$ pip3 install nltk
### download
NLTK can be download resouces
– names, stopwords, state_union, twitter_samples, moview_review, averaged_perceptron_tagger, vader_lexicon, punkt
import nltk
nltk.download([
"names",
"stopwords",
"state_union",
"twitter_samples",
"movie_reviews",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])
State of union corpus
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
to use stop words
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]
word_tokenize()
text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal.
"""
pprint(nltk.word_tokenize(text), width=79, compact=True)
most common
fd = nltk.FreqDist(words)
pprint(fd.most_common(3))
$ python3 app.py
[(‘must’, 1568), (‘people’, 1291), (‘world’, 1128)]
specific word
fd = nltk.FreqDist(words)
pprint(fd["America"])
$ python3 app.py
1076
### concordance
どこに出現するかを示す
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)
$ python3 app.py
Displaying 5 of 1079 matches:
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to
text = nltk.Text(nltk.corpus.state_union.words())
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
print(entry.line)
$ python3 app.py
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
other frequency distribution
words: list[str] = nltk.word_tokenize(
"""Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.""")
text = nltk.Text(words)
fd = text.vocab()
fd.tabulate(3)
collocation
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
pprint(finder.ngram_fd.most_common(2))
pprint(finder.ngram_fd.tabulate(2))
$ python3 app.py
[((‘the’, ‘United’, ‘States’), 294), ((‘the’, ‘American’, ‘people’), 185)]
(‘the’, ‘United’, ‘States’) (‘the’, ‘American’, ‘people’)
294 185
nltkが強力なのはわかった。