Learning Natural Language Processing with Python NLTK:
Analyzing the book of Psalm of David

by Kardi Teknomo

NLTK is an open source module for natural language toolkits for Python. In this simple tutorial, I am using Python 3 and you will learn how to do the following jargons in Natural Language Processing (NLP):

  • Tokenization
  • Stop Words
  • Bag of Words
  • N-Grams (bigram, trigram, n-gram)

Let us start!

To use nltk, simply import it.

import nltk

If it is the first time you run nltk you may want to run:

nltk.download()

It will open a window to download the additional resources of nltk

In [1]:
import nltk

Text Data

You can get data from any file or from any web page. For the text example in this tutorial, I will use the Psalm of David from Gutenberg project. For this purpose, we will use urllib and beautifulsoup.

In [2]:
import urllib.request as urllib2
from bs4 import BeautifulSoup
url='http://www.gutenberg.org/cache/epub/13166/pg13166.txt' # Psalm of David
page = urllib2.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,"lxml")
myText=soup.get_text()
myText[:100]
Out[2]:
'The Project Gutenberg EBook of The Psalms of David, by Isaac Watts\r\n\r\nThis eBook is for the use of a'

We shall focus on only to the original Psalm of David. Thus, we ignore the first part and last part of the book to clean up the commentary and the copyright notice.

Then, save this Psalm into a text file.

In [3]:
myText=myText[21514:364084] # only the main content of the Psalm of David

f = open('myText.txt','w')
f.write(myText)
f.close()

What we will use is the clean version of the Psalm from the text file that we just saved.

In [4]:
f = open('myText.txt','r')
myText=f.read()
f.close()
myText[:250] # show only the first 200 digits
Out[4]:
"THE Psalms of David,\n\nIn Metre.\n\n\n\n\n\nPsalm 1:1. Common Metre,\n\nThe way and end of the righteous and the wicked.\n\n\n\n1 Blest is the man who shuns the place\n\nWhere sinners love to meet;\n\nWho fears to tread their wicked ways,\n\nAnd hates the scoffer's sea"

Tokenization

Tokenization is splitting the sentences and words from the text.

To tokenize the text into sentences, use sent_tokenize().

In [5]:
sentences = nltk.sent_tokenize(myText)
sentences[:3]
Out[5]:
['THE Psalms of David,\n\nIn Metre.',
 'Psalm 1:1.',
 'Common Metre,\n\nThe way and end of the righteous and the wicked.']

To separate the text or sentences into words with puctuation, we use word_tokenize as our tokenizer. Tokenization with punctuation may be useful for text synthesis or text generation.

In [6]:
words = nltk.word_tokenize(myText)
words[:10]
Out[6]:
['THE', 'Psalms', 'of', 'David', ',', 'In', 'Metre', '.', 'Psalm', '1:1']

To separate a sentence into words without puctuation, we use RegexpTokenizer(r'\w+') as our tokenizer. Tokenization without punctuation is useful for text analysis.

In [7]:
tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+')
words =tokenizer.tokenize(myText)
words[:10]
Out[7]:
['THE', 'Psalms', 'of', 'David', 'In', 'Metre', 'Psalm', '1', '1', 'Common']

Stop Words

Stop words are common words or often used words that convey no additional meaning to the context. Therefore, in the context of analysis, we want to remove it such that we can get only the words in the context.

In [8]:
stopWords=sorted(nltk.corpus.stopwords.words('english'))
stopWords[95:105]
Out[8]:
['not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other']

To remove the stop words from the list of words, we can use Python list comprehensions.

In [9]:
words = [t.lower() for t in words if t not in stopWords]
words[:10]
Out[9]:
['the', 'psalms', 'david', 'in', 'metre', 'psalm', '1', '1', 'common', 'metre']

Now let us put pieces of code and what we have learned above about the tokenization into a function with more options.

In [10]:
def tokenizeText(text,includePunctuation=True, \
               includeStopWords=False,isLowerCase=True, \
               isRemoveNumbers=False):
    '''
    Given text, return a list of tokens (words or punctuation)
    
    Options:
        includePunctuation = True (default) if the bag of words include punctuation as token
                           = False if the bag of words exclude punctuation.
        includeStopWords = True if stop words are not cleaned from bag of words
                         = False (default) to return clean words without stop words.
        isLowerCase = True (default) if all words are transformed into lower case
                    = False if no transformation of case
        isRemoveNumbers = True to strip all numbers from the text
                        = False (default) if no numbers to be stripped off the text
    '''
    if isRemoveNumbers==True:
        import re
        text = re.sub("\d+", " ", text)
        
    if includePunctuation==True:
        # include punctuation as part of token or word
        tokens = [word for token in nltk.sent_tokenize(text) 
                  for word in nltk.word_tokenize(token)]
    else:
        # remove punctuation, words only
        tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+')
        tokens = [word for token in nltk.sent_tokenize(text) 
                  for word in tokenizer.tokenize(token)]
        
    if isLowerCase==True:
        tokens=[word.lower() for word in tokens]
        
    if includeStopWords==False:
        stopWords=set(nltk.corpus.stopwords.words('english'))  # load stop words
        tokens = [t for t in tokens if t not in stopWords]     # cleaning word from stop words 
    
    return tokens


# sample command
words=tokenizeText(myText, includePunctuation=True, \
               includeStopWords=False, isLowerCase=True, \
               isRemoveNumbers=False)
words[:10]
Out[10]:
['psalms', 'david', ',', 'metre', '.', 'psalm', '1:1', '.', 'common', 'metre']

Bag of Words

Bag of Words is a table of words and the count of each word in the given text.

To create bag of words, we first do tokenization and then we count the frequency of each words in the given text.

The following code counts each word. The keys of the dictionary is the tokenized words. The value of the dictionary is the count of each word.

In [11]:
# put each word and the count into a dictionary
freqDist={}
for t in sorted(set(words)):
    freqDist[t]=words.count(t)
    
dict(list(freqDist.items())[3650:3660])    # show a few contents of the dictionary
Out[11]:
{'shun': 6,
 'shuns': 2,
 'sick': 1,
 'sick-bed': 1,
 'sickly': 2,
 'sickness': 8,
 'sicknesses': 1,
 'side': 17,
 'sigh': 7,
 'sighs': 4}

The following function bagOfWord will return a dictionary of words in the text and the frequency on how often each the word has been utilized in the text.

In [12]:
def bagOfWords(tokens):
    '''
    Given list of tokens, return a dictionary where 
         keys  = words in the text
         value = count of the words in the text    
    '''    
    # put each word and the count into a dictionary
    freqDist={}
    for t in sorted(set(tokens)):
        freqDist[t]=tokens.count(t)
        
    return freqDist


# to use the function
words=tokenizeText(myText, includePunctuation=False, \
               includeStopWords=False, isLowerCase=True, \
               isRemoveNumbers=True)
bags=bagOfWords(words)
print({k: bags[k] for k in list(bags.keys())[2500:2550]},' of ', len(bags), ' words')
{'private': 3, 'privilege': 1, 'prize': 1, 'proceed': 2, 'proceeds': 5, 'proclaim': 35, 'proclaims': 4, 'procur': 1, 'procure': 1, 'procures': 1, 'produce': 1, 'product': 1, 'profan': 1, 'profane': 7, 'profanely': 3, 'profess': 2, 'professed': 1, 'profession': 2, 'profit': 4, 'profound': 1, 'projects': 1, 'prolong': 1, 'prolongs': 1, 'promis': 18, 'promise': 26, 'promises': 14, 'promotion': 1, 'pronounc': 2, 'pronounce': 13, 'proof': 1, 'prop': 2, 'proper': 5, 'prophet': 3, 'prophetic': 2, 'prophets': 1, 'proportion': 1, 'prospect': 1, 'prosperity': 3, 'prosperous': 1, 'protect': 1, 'protection': 3, 'protects': 2, 'proud': 25, 'proudest': 3, 'prov': 2, 'prove': 25, 'proved': 1, 'proverb': 1, 'proves': 3, 'provide': 2}  of  3903  words

It would be interesting to see the most frequent words in the text.

In [13]:
sortedBags = sorted(bags.items(), key=lambda x: x[1], reverse=True)
sortedBags[:10]
Out[13]:
[('thy', 1288),
 ('god', 702),
 ('shall', 592),
 ('lord', 517),
 ('psalm', 379),
 ('grace', 293),
 ('let', 250),
 ('love', 215),
 ('praise', 214),
 ('part', 196)]

N-gram

Sometimes, two words or three words are always together in a sentences. We would like to know how often those consecutive words are used in the text.

Bigram is a list of two consecutive words. Trigram is a list of three consecutive words. N-gram is list of N consecutive words.

Bigram

NLTK has function on bigram.

In [14]:
bigram = list(nltk.bigrams(words))
bigram[:10]
Out[14]:
[('psalms', 'david'),
 ('david', 'metre'),
 ('metre', 'psalm'),
 ('psalm', 'common'),
 ('common', 'metre'),
 ('metre', 'way'),
 ('way', 'end'),
 ('end', 'righteous'),
 ('righteous', 'wicked'),
 ('wicked', 'blest')]

Similar procedure as above, we can also count the frequency of bigrams and then sort them and get the most frequent used bigrams.

In [15]:
bigramCount={}
for t in sorted(set(bigram)):
    bigramCount[t]=bigram.count(t)
sortedBigram = sorted(bigramCount.items(), key=lambda x: x[1], reverse=True)
sortedBigram[:10]
Out[15]:
[(('thy', 'word'), 75),
 (('first', 'part'), 69),
 (('psalm', 'first'), 69),
 (('second', 'part'), 63),
 (('thy', 'grace'), 63),
 (('psalm', 'second'), 56),
 (('thy', 'name'), 48),
 (('psalm', 'c'), 47),
 (('ne', 'er'), 39),
 (('thy', 'throne'), 37)]

Trigram

NLTK has function for trigram. Let us use this function to get the most often used trigram.

In [16]:
trigram = list(nltk.trigrams(words))
trigramCount={}
for t in sorted(set(trigram)):
    trigramCount[t]=trigram.count(t)
sortedTrigram = sorted(trigramCount.items(), key=lambda x: x[1], reverse=True)
sortedTrigram[:3]
Out[16]:
[(('psalm', 'first', 'part'), 68),
 (('psalm', 'second', 'part'), 55),
 (('psalm', 'third', 'part'), 19)]

N-grams

NLTK also has n-grams where we can specify the number of words that goes together.

In [17]:
n = 4
nGrams = list(nltk.ngrams(words, n))
nGrams[:3]
Out[17]:
[('psalms', 'david', 'metre', 'psalm'),
 ('david', 'metre', 'psalm', 'common'),
 ('metre', 'psalm', 'common', 'metre')]

Let us generalize our lesson about n-grams with the following function.

In [18]:
def oftenUsedNGram(tokens,N):
    nGrams = list(nltk.ngrams(tokens, N))
    nGramCount={}
    for t in sorted(set(nGrams)):
        nGramCount[t]=nGrams.count(t)
    sortedNgram = sorted(nGramCount.items(), key=lambda x: x[1], reverse=True)
    return sortedNgram

sevenGram=oftenUsedNGram(words,7)
sevenGram[:3]
Out[18]:
[(('lord', 'shall', 'still', 'endure', 'ever', 'sure', 'abides'), 5),
 (('mercy', 'lord', 'shall', 'still', 'endure', 'ever', 'sure'), 5),
 (('power', 'grace', 'still', 'let', 'name', 'endless', 'praise'), 5)]

We analyze the n-grams further to check which words are on the top of each n-gram from n=1 to 10. You will also see how the frequency are decreased when the number of words are increased.

In [19]:
nGram=[]
for n in range(1,11):
    nGram.append(oftenUsedNGram(words,n)[1])
nGram
Out[19]:
[(('god',), 702),
 (('first', 'part'), 69),
 (('psalm', 'second', 'part'), 55),
 (('psalm', 'first', 'part', 'l'), 13),
 (('ever', 'sure', 'abides', 'thy', 'word'), 5),
 (('grace', 'still', 'let', 'name', 'endless', 'praise'), 5),
 (('mercy', 'lord', 'shall', 'still', 'endure', 'ever', 'sure'), 5),
 (('mercy', 'lord', 'shall', 'still', 'endure', 'ever', 'sure', 'abides'), 5),
 (('mercy',
   'lord',
   'shall',
   'still',
   'endure',
   'ever',
   'sure',
   'abides',
   'thy'),
  5),
 (('thy',
   'mercy',
   'lord',
   'shall',
   'still',
   'endure',
   'ever',
   'sure',
   'abides',
   'thy'),
  5)]

last update: Feb 2018

Cite this tutorial as:

Teknomo,K. (2018) Learning Natural Language Processing with Python NLTK: Analyzing the book of Psalm of David accessed from (http://people.revoledu.com/kardi/tutorial/Python/)

See Also: Python for Data Science

Visit www.Revoledu.com for more tutorials in Data Science

Copyright © 2018 Kardi Teknomo

Permission is granted to share this notebook as long as the copyright notice is intact.