Natural Language – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

transformer

import numpy as np 

# 3単語 4次元ベクトル
X = np.array([
    [1, 0, 1, 0],
    [0, 2, 0, 2],
    [1, 1, 1, 1]
])

W_q = np.random.rand(4, 2)
W_k = np.random.rand(4, 2)
W_v = np.random.rand(4, 2)

Q = X @ W_q
K = X @ W_k
V = X @ W_v

attention_scores = Q @ K.T

dk = Q.shape[-1]
attention_scores = attention_scores / np.sqrt(dk)

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

attention_weights = softmax(attention_scores)

output = attention_weights @ V

print("Input X:\n", X)
print("\nAttention Weights:\n", attention_weights)
print("\nOutput:\n", output)

【ChatGPT】PHPでリクエストを送ってみる

$input_json = file_get_contents('php://input');
$post = json_decode( $input_json, true );
$req_question = $post['prompt'];

$result = array();

// APIキー
$apiKey = '***';

//openAI APIエンドポイント
$endpoint = 'https://api.openai.com/v1/chat/completions';

$headers = array(
  'Content-Type: application/json',
  'Authorization: Bearer ' . $apiKey
);

// リクエストのペイロード
$data = array(
  'model' => 'gpt-3.5-turbo',
  'messages' => [
    [
    "role" => "system",
    "content" => "新宿はどんな所ですか？"
    ],
    // [
    // "role" => "user",
    // "content" => $req_question
    // ]
  ]
);

// cURLリクエストを初期化
$ch = curl_init();

// cURLオプションを設定
curl_setopt($ch, CURLOPT_URL, $endpoint);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($data));
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

// APIにリクエストを送信
$response = curl_exec($ch);

// cURLリクエストを閉じる
curl_close($ch);

// 応答を解析
$result = json_decode($response, true);

// 生成されたテキストを取得
$text = $result['choices'][0]['message']['content'];

var_dump($result);

レスポンスは返ってくるが、時間がかかりますね。
$ php index.php
array(6) {
[“id”]=>
string(38) “chatcmpl-7jHpzpu5LnjqM2dbxoBQrx0Nrp0DQ”
[“object”]=>
string(15) “chat.completion”
[“created”]=>
int(1691027683)
[“model”]=>
string(18) “gpt-3.5-turbo-0613”
[“choices”]=>
array(1) {
[0]=>
array(3) {
[“index”]=>
int(0)
[“message”]=>
array(2) {
[“role”]=>
string(9) “assistant”
[“content”]=>
string(1323) “新宿は東京都内で最も繁華なエリアの一つです。駅周辺には高層ビルが立ち並び、多くの人が行き交います。新宿駅は日本でも最も利用者が多い駅の一つで、多くの鉄道路線が交差し、バスターミナルもあるため、アクセスが非常に便利です。

新宿には大型商業施設やデパート、ショッピングモールが集まっており、様々なショップやレストランがあります。また、歌舞伎町という繁華街もあり、夜になると多くの人々で賑わいます。歓楽街として知られており、多くの居酒屋、バー、クラブがあります。

また、新宿は文化施設も充実しており、新宿御苑や東京都庁舎、新宿中央公園などの公共の場所で自然に触れることもできます。さらに、新宿の西側には高層ビルが連なる都市の景色が楽しめる新宿西口地区もあります。

新宿はまた、交通の要所としても知られており、多くの人々が通勤や買い物などで訪れます。そのため、駅周辺は常に混雑していることが多いですが、多くの施設やイベントが盛り上がっているため、観光客や地元の人々にとっても魅力的な場所です。”
}
[“finish_reason”]=>
string(4) “stop”
}
}
[“usage”]=>
array(3) {
[“prompt_tokens”]=>
int(18)
[“completion_tokens”]=>
int(490)
[“total_tokens”]=>
int(508)
}
}

ChatGPTを動かしてみる

$ pip install openapi

import openapi

openai.api_key = '***'

response = openai.ChatCompletion.create(
	mmodel="gtp-3.5-turbo",
	messages=[
		{"role": "system", "content": "You are an assistant that knows a lot about animals."},
		{"role": "user", "content": "Tell me about elephants."},
	]

)
print(response['choices'][0]['message']['content'])

うーむ、なんか上手くいかんな

[自然言語処理] chatterbotで日本語のcorpusを使う

$ git clone https://github.com/gunthercox/chatterbot-corpus.git
$ cp -r chatterbot-corpus/chatterbot_corpus/data/japanese /home/vagrant/.local/lib/python3.8/site-packages/chatterbot_corpus/data

from chatterbot import ChatBot 
from chatterbot.trainers import ListTrainer
from chatterbot.trainers import ChatterBotCorpusTrainer

bot = ChatBot("チャットボット試作品")

trainer = ChatterBotCorpusTrainer(bot)
trainer.train("chatterbot.corpus.japanese")

while True:
	try:
		bot_input = bot.get_response(input())
		print(bot_input)
	except(KeyboardInterrupt, EOFError, SystemExit):
		break

出身はどちらですか
私はすべてのソフトウェアプログラムの出所です。遠く離れた銀河。
こんにちは
こちらこそこんにちは
宜しくお願いします
a
ところで
か
今日は何の日ですか？
不明な変数名。
疲れたー
わかります。
プログラミングは面白い
橋
何だって？
ピーターパークカー。
何？
それはあなたが一度だけ生きていることを意味します。あなたはそれをどこで聞いたのですか？
いいえ
顕著ではない。
その通り。
どの言語を使いたいですか？
日本語
0と1以外のすべての数字のように、多くのもの。
何のこと？
関心度
関心度
それはすべて中央銀行の行動に依存します。
経済やん
はい、そうです。
そうなのか
ロボットは嘘をついてはいけません
よくわからん
ま
まあ
株式市場

うーむ、全く会話になってないな…

[自然言語処理] sumyで要約を作りたい

$ pip3 install sumy

import MeCab

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

text = ""

def summy_test(text):
    dic_url = ""
    tagger = MeCab.Tagger()
    key = tagger.parse(text)
    corpus = []
    for row in key.split("\n"):
        word = row.split("\t")[0]
        if word == "EOS":
            break
        else:
            corpus.append(word)

    parser = PlaintextParser.from_string(text, Tokenizer('japanese'))

    summarizer = LexRankSummarizer()
    summarizer.stop_words = ['']
    summary = summarizer(document=parser.document, sentences_count=2)
    b = []
    for sentence in summary:
        b.append(sentence.__str__())
    return "".join(b)

print(summy_test(text))

簡単やわ

[自然言語処理] python x transformerで感情分析

$ pip3 install fugashi
$ pip3 install ipadic

sentiment.py

# -*- coding: utf-8 -*-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("daigo/bert-base-japanese-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("daigo/bert-base-japanese-sentiment")

print(pipeline("sentiment-analysis", model="daigo/bert-base-japanese-sentiment", tokenizer="daigo/bert-base-japanese-sentiment")("私は幸福である。"))

$ python3 sentiment.py
[{‘label’: ‘ポジティブ’, ‘score’: 0.9843042492866516}]

単語やニュートラルな文章はポジティブに判定されやすい

“もうダメだ” にして再実行
$ python3 sentiment.py
[{‘label’: ‘ネガティブ’, ‘score’: 0.9892264604568481}]

なるほど
処理時間がかかるのがスネに傷やな

[自然言語処理] word2vecによる類似単語を検索

$ pip3 install gensim==3.8.1

### word2vecのファイルをDL
$ wget http://public.shiroyagi.s3.amazonaws.com/latest-ja-word2vec-gensim-model.zip

$ unzip latest-ja-word2vec-gensim-model.zip

app.py

# -*- coding: utf-8 -*-
from gensim.models import word2vec

model = word2vec.Word2Vec.load('word2vec.gensim.model')
results = model.wv.most_similar(positive=['日本'])
for result in results:
	print(result)

$ python3 app.py
(‘韓国’, 0.7088127732276917)
(‘台湾’, 0.6461570262908936)
(‘日本国内’, 0.6403890252113342)
(‘欧米’, 0.6350583434104919)
(‘日本国外’, 0.6200590133666992)
(‘台湾出身’, 0.6174061894416809)
(‘中華圏’, 0.612815260887146)
(‘日本の経済’, 0.6088820099830627)
(‘日本の歴史’, 0.6070738434791565)
(‘韓国国内’, 0.6054152250289917)

gensimはバージョンを指定しないとエラーになるので注意が必要

OK
これを繋げていく

[自然言語処理] pythonによる文章自動生成

transformersを使います
$ pip3 install transformers==4.3.3 torch==1.8.0 sentencepiece==0.1.91

# -*- coding: utf-8 -*-
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("colorfulscoop/gpt2-small-ja")
model = transformers.AutoModelForCausalLM.from_pretrained("colorfulscoop/gpt2-small-ja")

input = tokenizer.encode("昔々あるところに", return_tensors="pt")
output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)

print(tokenizer.batch_decode(output))

[‘昔々あるところには、お祭りの女神さんが現れ、そこでお姫様の姫様’, ‘昔々あるところに、ある。ある夏の日、彼は旅人と出会い、その目的がどう’, ‘昔々あるところに、一億年も前には人間たちが住んでいた。いまや、それはこの’]

おおお、なんか色々出来そうではある…
途中の処理を考える必要はあるが

[自然言語処理] pythonとasariで感情分析を行う

$ pip3 install asari

from asari.api import Sonar

sonar = Sonar()
sonar.ping(text="広告が多すぎる♡")

$ python3 asari.py
Traceback (most recent call last):
File “asari.py”, line 1, in
from asari.api import Sonar
File “/home/vagrant/dev/nlp/asari.py”, line 1, in
from asari.api import Sonar
ModuleNotFoundError: No module named ‘asari.api’; ‘asari’ is not a package

—-
Requirement already satisfied: joblib>=0.11 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (2.2.0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas>=0.22.0->asari) (1.14.0)
Installing collected packages: pytz, pandas, Janome, asari
Successfully installed Janome-0.4.1 asari-0.0.4 pandas-1.3.4 pytz-2021.3

python2系じゃないと動かないのか？

python2系の環境で再度やり直します。
$ pip install scikit-learn==0.20.4
$ pip install Janome==0.3.7

[{‘text’: ‘広’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.28345109397418017}, {‘class_name’: ‘positive’, ‘confidence’: 0.7165489060258198}]}, {‘text’: ‘告’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.3450436370027103}, {‘class_name’: ‘positive’, ‘confidence’: 0.6549563629972897}]}, {‘text’: ‘が’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9006654377630458}, {‘class_name’: ‘positive’, ‘confidence’: 0.09933456223695437}]}, {‘text’: ‘多’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9364137330979464}, {‘class_name’: ‘positive’, ‘confidence’: 0.06358626690205357}]}, {‘text’: ‘す’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.2124141523260128}, {‘class_name’: ‘positive’, ‘confidence’: 0.7875858476739873}]}, {‘text’: ‘ぎ’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.5383816766180572}, {‘class_name’: ‘positive’, ‘confidence’: 0.4616183233819428}]}, {‘text’: ‘る’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.6881484923868434}, {‘class_name’: ‘positive’, ‘confidence’: 0.3118515076131566}]}]

# -*- coding: utf-8 -*-
from asari.api import Sonar

sonar = Sonar()
text="広告が多すぎる"
res = sonar.ping(text="広告多すぎる♡")
print(res)

$ python3 app.py
{‘text’: ‘広告多すぎる♡’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9086981552962491}, {‘class_name’: ‘positive’, ‘confidence’: 0.0913018447037509}]}

なるほど、janomeとscikit-learnはversionを指定すると動く

[SnowNLP] Python3で中国語の自然言語処理

$ pip3 install snownlp

tokenization

from snownlp import SnowNLP

s = SnowNLP(u'今天是周六。')
print(s.words)

$ python3 snow.py
[‘今天’, ‘是’, ‘周六’, ‘。’]

speech tagにするとnoun, adverb, verb, adjectiveなどを表現できます。

print(list(s.tags))

$ python3 snow.py
[(‘今天’, ‘t’), (‘是’, ‘v’), (‘周六’, ‘t’), (‘。’, ‘w’)]

pinyin

print(s.pinyin)

$ python3 snow.py
[‘jin’, ‘tian’, ‘shi’, ‘zhou’, ‘liu’, ‘。’]

sentences

s = SnowNLP(u'在茂密的大森林里，一只饥饿的老虎逮住了一只狐狸。老虎张开大嘴就要把狐狸吃掉。"慢着"！狐狸虽然很害怕但还是装出一副很神气的样子说，"你知道我是谁吗？我可是玉皇大帝派来管理百兽的兽王，你要是吃了我，玉皇大帝是决不会放过你的"。')
print(s.sentences)

[‘在茂密的大森林里’, ‘一只饥饿的老虎逮住了一只狐狸’, ‘老虎张开大嘴就要把狐狸吃掉’, ‘”慢着”‘, ‘狐狸虽然很害怕但还是装出一副很神气的样子说’, ‘”你知道我是谁吗’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘你要是吃了我’, ‘玉皇大帝是决不会放过你的”‘]

keyword

print(s.keywords(5))

$ python3 snow.py
[‘狐狸’, ‘大’, ‘老虎’, ‘大帝’, ‘皇’]

summary

print(s.summary(3))

[‘老虎张开大嘴就要把狐狸吃掉’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘玉皇大帝是决不会放过你的”‘]

sentiment analysis

text = SnowNLP(u'这个产品很好用，这个产品不好用，这个产品是垃圾，这个也太贵了吧，超级垃圾，是个垃圾中的垃圾')
sent = text.sentences
for sen in sent:
	s = SnowNLP(sen)
	print(s.sentiments)

$ python3 snow.py
0.7853504415636449
0.5098208142944668
0.13082804652201174
0.5
0.0954842128485538
0.04125325276132508

0から1の値を取り、1に近づくほどポジティブ、0に近いほどネガティブとなります。