Bag-of-Words

単語にベクトルの各列を割り当てておいて、出現回数などを要素とすることで文章をベクトル化したものを、Bag-of-Wordsベクトルと呼ぶ。

scikit-learnのCountVetctorizer
トランプ大統領の発言を解析します。

import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
		'If you want freedom, take pride in your country. If you want democracy, hold on to your sovereignty. If you want peace, love your Nation',
		'President Donald J. Trump has shown that the path to prosperity and strength lies in lifting up our people and respecting our sovereignty'
	])
bag = count.fit_transform(docs)
print(count.vocabulary_)

[vagrant@localhost python]$ python app.py
{‘if’: 7, ‘you’: 32, ‘want’: 31, ‘freedom’: 4, ‘take’: 25, ‘pride’: 19, ‘in’: 8, ‘your’: 33, ‘country’: 1, ‘democracy’: 2, ‘hold’: 6, ‘on’: 13, ‘to’: 28, ‘sovereignty’: 23, ‘peace’: 16, ‘love’: 11, ‘nation’: 12, ‘president’: 18, ‘donald’: 3, ‘trump’: 29, ‘has’: 5, ‘shown’: 22, ‘that’: 26, ‘the’: 27, ‘path’: 15, ‘prosperity’: 20, ‘and’: 0, ‘strength’: 24, ‘lies’: 9, ‘lifting’: 10, ‘up’: 30, ‘our’: 14, ‘people’: 17, ‘respecting’: 21}

アメリカの大統領ともなると、”you”って言葉を多用されるのでしょうか。
単語ベクトルと言うと、mecabの頻出単語とは異なる印象です。