fastTextで英文のジャンル分類(Text Classification)

### Getting and preparing data
$ wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
$ ls
cooking.stackexchange.id
$ head cooking.stackexchange.txt
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?

“__label__” prefix is how fasttext recognize difference of word and label.

$ wc cooking.stackexchange.txt
15404 169582 1401900 cooking.stackexchange.txt
-> split example and validation
$ head -n 12404 cooking.stackexchange.txt > cooking.train
$ tail -n 3000 cooking.stackexchange.txt > cooking.valid

### train_supervised
training.py

import fasttext
model = fasttext.train_supervised(input="cooking.train")

$ python3 training.py
Read 0M words
Number of words: 14543
Number of labels: 735
Floating point exception

何故だ〜〜〜〜〜〜〜〜〜〜

fastTextによるカテゴリ分類

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
$ pip3 install cython
$ pip3 install fasttext

$ install requests requests_oauthlib

Twitterの開発者向けのページから、consumer key, consumer secret, access token, access token secretを取得します。

tweet_get.py

import re
import json
import MeCab
from requests_oauthlib import OAuth1Session

CK = ""
CS = ""
AT = ""
AS = ""

API_URL = "https://api.twitter.com/1.1/search/tweets.json?tweet_mode=extended"
KEYWORD = "芸能 OR アニメ OR 漫画 OR TV OR ゲーム"
CLASS_LABEL = "__label__1"

def main():
	tweets = get_tweet()
	surfaces = get_surfaces(tweets)
	write_txt(surfaces)

def get_tweet():
    params = {'q' : KEYWORD, 'count' : 20}
    twitter = OAuth1Session(CK, CS, AT, AS)
    req = twitter.get(API_URL, params = params)
    results = []
    if req.status_code == 200:
        tweets = json.loads(req.text)
        for tweet in tweets['statuses']:
            results.append(tweet['full_text'])
        return results
    else:
        print ("Error: %d" % req.status_code)

def get_surfaces(contents):
    results = []
    for row in contents:
        content = format_text(row)
        tagger = MeCab.Tagger('')
        tagger.parse('')
        surf = []
        node = tagger.parseToNode(content)
        while node:
            surf.append(node.surface)
            node = node.next
        results.append(surf)
    return results

def write_txt(contents):
	try:
		if(len(contents) > 0):
			fileName = CLASS_LABEL + ".txt"
			labelText = CLASS_LABEL + ","

			f = open(fileName,'a')
			for row in contents:
				spaceTokens = "".join(row);
				result = labelText + spaceTokens + "\n"
				f.write(result)
			f.close()

		print(str(len(contents))+"行を書き込み")

	except Exception as e:
		print("テキストの書き込みに失敗")
		print(e)

def format_text(text):
	text=re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(r'@[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(r'&[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(';', "", text)
	text=re.sub('RT', "", text)
	text=re.sub('\n', " ", text)
	return text

if __name__ == '__main__':
	main()

1.エンタメ、2.美容、3.住まい/暮らし のキーワードで集めます。

__label__1: エンタメ
-> 芸能 OR アニメ OR 漫画 OR TV OR ゲーム
__label__2: 美容
-> 肌 OR ヨガ OR 骨盤 OR ウィッグ OR シェイプ
__label__3: 住まい/暮らし
-> リフォーム OR 住宅 OR 家事 OR 収納 OR 食材

label1〜3のテキストを結合させます
$ cat __label__1.txt __label__2.txt __label__3.txt > model.txt
$ ls
__label__1.txt __label__3.txt model.txt
__label__2.txt fastText tweet_get.py

learning.py

import sys
import fasttext as ft

argvs = sys.argv
input_file = argvs[1]
output_file = argvs[2]

classifier = ft.supervised(input_file, output_file)

$ python3 learning.py model.txt model
raise Exception(“`supervised` is not supported any more. Please use `train_supervised`. For more information please refer to https://fasttext.cc/blog/2019/06/25/blog-post.html#2-you-were-using-the-unofficial-fasttext-module”)
Exception: `supervised` is not supported any more. Please use `train_supervised`.

model = ft.train_supervised(input_file)

$ python3 learning.py model.txt ftmodel
Read 0M words
Number of words: 6
Number of labels: 135
Floating point exception

おかしいな、number of labelsが135ではなく3のはずなんだけど。。。

TfIdfとは?

### TfIdf
Tf = Term Frequency
Idf = Inverse Document Frequency

Tfはドキュメント内の単語の出現頻度、Idfは全ての文章内の単語の出現頻度の逆数
TfIdfを使うと、いろいろな文章に出てくる単語は無視して、ある文章に何回も出てくる単語は重要な語として扱う

def get_vector_by_text_list(_items):
	count_vect = TfidfVectorizer(analyzer=_split_to_words)
	# count_vect = CountVectorizer(analyzer=_split_to_words)
	box = count_vect.fit_transform(_items)
	X = box.todense()
	return [X,count_vect]

MLPClassifierとは、Multi-layer Perceptron classifierの略で多層パーセプトロンと呼ばれる分類器のモデル