$ pip3 install nltk
### download
NLTK can be download resouces
– names, stopwords, state_union, twitter_samples, moview_review, averaged_perceptron_tagger, vader_lexicon, punkt
import nltk nltk.download([ "names", "stopwords", "state_union", "twitter_samples", "movie_reviews", "averaged_perceptron_tagger", "vader_lexicon", "punkt", ])
State of union corpus
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
to use stop words
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()] stopwords = nltk.corpus.stopwords.words("english") words = [w for w in words if w.lower() not in stopwords]
word_tokenize()
text = """ For some quick analysis, creating a corpus could be overkill. If all you need is a word list, there are simpler ways to achieve that goal. """ pprint(nltk.word_tokenize(text), width=79, compact=True)
most common
fd = nltk.FreqDist(words) pprint(fd.most_common(3))
$ python3 app.py
[(‘must’, 1568), (‘people’, 1291), (‘world’, 1128)]
specific word
fd = nltk.FreqDist(words) pprint(fd["America"])
$ python3 app.py
1076
### concordance
どこに出現するかを示す
text = nltk.Text(nltk.corpus.state_union.words()) text.concordance("america", lines=5)
$ python3 app.py
Displaying 5 of 1079 matches:
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to
text = nltk.Text(nltk.corpus.state_union.words()) concordance_list = text.concordance_list("america", lines=2) for entry in concordance_list: print(entry.line)
$ python3 app.py
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
other frequency distribution
words: list[str] = nltk.word_tokenize( """Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.""") text = nltk.Text(words) fd = text.vocab() fd.tabulate(3)
collocation
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()] finder = nltk.collocations.TrigramCollocationFinder.from_words(words) pprint(finder.ngram_fd.most_common(2)) pprint(finder.ngram_fd.tabulate(2))
$ python3 app.py
[((‘the’, ‘United’, ‘States’), 294), ((‘the’, ‘American’, ‘people’), 185)]
(‘the’, ‘United’, ‘States’) (‘the’, ‘American’, ‘people’)
294 185
nltkが強力なのはわかった。