[NLTK] Word frequency

$ pip3 install nltk

### download
NLTK can be download resouces
– names, stopwords, state_union, twitter_samples, moview_review, averaged_perceptron_tagger, vader_lexicon, punkt

import nltk

nltk.download([
	"names",
	"stopwords",
	"state_union",
	"twitter_samples",
	"movie_reviews",
	"averaged_perceptron_tagger",
	"vader_lexicon",
	"punkt",
])

State of union corpus

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

to use stop words

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]

word_tokenize()

text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal.
"""
pprint(nltk.word_tokenize(text), width=79, compact=True)

most common

fd = nltk.FreqDist(words)
pprint(fd.most_common(3))

$ python3 app.py
[(‘must’, 1568), (‘people’, 1291), (‘world’, 1128)]

specific word

fd = nltk.FreqDist(words)
pprint(fd["America"])

$ python3 app.py
1076

### concordance
どこに出現するかを示す

text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

$ python3 app.py
Displaying 5 of 1079 matches:
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to

text = nltk.Text(nltk.corpus.state_union.words())
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
	print(entry.line)

$ python3 app.py
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace

other frequency distribution

words: list[str] = nltk.word_tokenize(
"""Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.""")
text = nltk.Text(words)
fd = text.vocab()
fd.tabulate(3)

collocation

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

pprint(finder.ngram_fd.most_common(2))
pprint(finder.ngram_fd.tabulate(2))

$ python3 app.py
[((‘the’, ‘United’, ‘States’), 294), ((‘the’, ‘American’, ‘people’), 185)]
(‘the’, ‘United’, ‘States’) (‘the’, ‘American’, ‘people’)
294 185

nltkが強力なのはわかった。