NLTK is an open source module for natural language toolkits for Python. In this simple tutorial, I am using Python 3 and you will learn how to do the following jargons in Natural Language Processing (NLP):
Let us start!
To use nltk, simply import it.
import nltk
If it is the first time you run nltk you may want to run:
nltk.download()
It will open a window to download the additional resources of nltk
import nltk
You can get data from any file or from any web page. For the text example in this tutorial, I will use the Psalm of David from Gutenberg project. For this purpose, we will use urllib and beautifulsoup.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url='http://www.gutenberg.org/cache/epub/13166/pg13166.txt' # Psalm of David
page = urllib2.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,"lxml")
myText=soup.get_text()
myText[:100]
We shall focus on only to the original Psalm of David. Thus, we ignore the first part and last part of the book to clean up the commentary and the copyright notice.
Then, save this Psalm into a text file.
myText=myText[21514:364084] # only the main content of the Psalm of David
f = open('myText.txt','w')
f.write(myText)
f.close()
What we will use is the clean version of the Psalm from the text file that we just saved.
f = open('myText.txt','r')
myText=f.read()
f.close()
myText[:250] # show only the first 200 digits
Tokenization is splitting the sentences and words from the text.
To tokenize the text into sentences, use sent_tokenize().
sentences = nltk.sent_tokenize(myText)
sentences[:3]
To separate the text or sentences into words with puctuation, we use word_tokenize as our tokenizer. Tokenization with punctuation may be useful for text synthesis or text generation.
words = nltk.word_tokenize(myText)
words[:10]
To separate a sentence into words without puctuation, we use RegexpTokenizer(r'\w+') as our tokenizer. Tokenization without punctuation is useful for text analysis.
tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+')
words =tokenizer.tokenize(myText)
words[:10]
Stop words are common words or often used words that convey no additional meaning to the context. Therefore, in the context of analysis, we want to remove it such that we can get only the words in the context.
stopWords=sorted(nltk.corpus.stopwords.words('english'))
stopWords[95:105]
To remove the stop words from the list of words, we can use Python list comprehensions.
words = [t.lower() for t in words if t not in stopWords]
words[:10]
Now let us put pieces of code and what we have learned above about the tokenization into a function with more options.
def tokenizeText(text,includePunctuation=True, \
includeStopWords=False,isLowerCase=True, \
isRemoveNumbers=False):
'''
Given text, return a list of tokens (words or punctuation)
Options:
includePunctuation = True (default) if the bag of words include punctuation as token
= False if the bag of words exclude punctuation.
includeStopWords = True if stop words are not cleaned from bag of words
= False (default) to return clean words without stop words.
isLowerCase = True (default) if all words are transformed into lower case
= False if no transformation of case
isRemoveNumbers = True to strip all numbers from the text
= False (default) if no numbers to be stripped off the text
'''
if isRemoveNumbers==True:
import re
text = re.sub("\d+", " ", text)
if includePunctuation==True:
# include punctuation as part of token or word
tokens = [word for token in nltk.sent_tokenize(text)
for word in nltk.word_tokenize(token)]
else:
# remove punctuation, words only
tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = [word for token in nltk.sent_tokenize(text)
for word in tokenizer.tokenize(token)]
if isLowerCase==True:
tokens=[word.lower() for word in tokens]
if includeStopWords==False:
stopWords=set(nltk.corpus.stopwords.words('english')) # load stop words
tokens = [t for t in tokens if t not in stopWords] # cleaning word from stop words
return tokens
# sample command
words=tokenizeText(myText, includePunctuation=True, \
includeStopWords=False, isLowerCase=True, \
isRemoveNumbers=False)
words[:10]
Bag of Words is a table of words and the count of each word in the given text.
To create bag of words, we first do tokenization and then we count the frequency of each words in the given text.
The following code counts each word. The keys of the dictionary is the tokenized words. The value of the dictionary is the count of each word.
# put each word and the count into a dictionary
freqDist={}
for t in sorted(set(words)):
freqDist[t]=words.count(t)
dict(list(freqDist.items())[3650:3660]) # show a few contents of the dictionary
The following function bagOfWord will return a dictionary of words in the text and the frequency on how often each the word has been utilized in the text.
def bagOfWords(tokens):
'''
Given list of tokens, return a dictionary where
keys = words in the text
value = count of the words in the text
'''
# put each word and the count into a dictionary
freqDist={}
for t in sorted(set(tokens)):
freqDist[t]=tokens.count(t)
return freqDist
# to use the function
words=tokenizeText(myText, includePunctuation=False, \
includeStopWords=False, isLowerCase=True, \
isRemoveNumbers=True)
bags=bagOfWords(words)
print({k: bags[k] for k in list(bags.keys())[2500:2550]},' of ', len(bags), ' words')
It would be interesting to see the most frequent words in the text.
sortedBags = sorted(bags.items(), key=lambda x: x[1], reverse=True)
sortedBags[:10]
Sometimes, two words or three words are always together in a sentences. We would like to know how often those consecutive words are used in the text.
Bigram is a list of two consecutive words. Trigram is a list of three consecutive words. N-gram is list of N consecutive words.
NLTK has function on bigram.
bigram = list(nltk.bigrams(words))
bigram[:10]
Similar procedure as above, we can also count the frequency of bigrams and then sort them and get the most frequent used bigrams.
bigramCount={}
for t in sorted(set(bigram)):
bigramCount[t]=bigram.count(t)
sortedBigram = sorted(bigramCount.items(), key=lambda x: x[1], reverse=True)
sortedBigram[:10]
NLTK has function for trigram. Let us use this function to get the most often used trigram.
trigram = list(nltk.trigrams(words))
trigramCount={}
for t in sorted(set(trigram)):
trigramCount[t]=trigram.count(t)
sortedTrigram = sorted(trigramCount.items(), key=lambda x: x[1], reverse=True)
sortedTrigram[:3]
NLTK also has n-grams where we can specify the number of words that goes together.
n = 4
nGrams = list(nltk.ngrams(words, n))
nGrams[:3]
Let us generalize our lesson about n-grams with the following function.
def oftenUsedNGram(tokens,N):
nGrams = list(nltk.ngrams(tokens, N))
nGramCount={}
for t in sorted(set(nGrams)):
nGramCount[t]=nGrams.count(t)
sortedNgram = sorted(nGramCount.items(), key=lambda x: x[1], reverse=True)
return sortedNgram
sevenGram=oftenUsedNGram(words,7)
sevenGram[:3]
We analyze the n-grams further to check which words are on the top of each n-gram from n=1 to 10. You will also see how the frequency are decreased when the number of words are increased.
nGram=[]
for n in range(1,11):
nGram.append(oftenUsedNGram(words,n)[1])
nGram
last update: Feb 2018
Cite this tutorial as:
Teknomo,K. (2018) Learning Natural Language Processing with Python NLTK: Analyzing the book of Psalm of David accessed from (http://people.revoledu.com/kardi/tutorial/Python/)
See Also: Python for Data Science
Visit www.Revoledu.com for more tutorials in Data Science
Copyright © 2018 Kardi Teknomo
Permission is granted to share this notebook as long as the copyright notice is intact.