Using Python to Perform Lexical Analysis on a Short Story

One of the more interesting things going on in the Big Data world right now is some of the quantitative analysis being done on how people use the English language. This is widely known as Natural Language Processing—“NLP”—a field that can be thought of as a fusion between artificial intelligence and linguistics.

A useful framework for getting started in NLP is called the Natural Language Processing Toolkit—“NLTK”— a Python library that is full of different tips, tricks, and approaches towards performing basic NLP. It also comes with a free guidebook with example code and explanations which can help anybody get started from only a basic knowledge of programming.

NLP is interesting to me because it opens the door for anybody with some computer skills to perform a large-scale analysis on language of any kind. I’ve written about how I enjoy following the media industry because it’s so central to how people communicate with each other, and my interest in language comes from that same place. Harnessing the power of data analysis on language could be used to discover patterns in language that we hadn’t considered before.

There’s a lot of room for creativity here. An interesting application of this might be a project that collects political news articles from two different newspapers over the course of a month, and then, using various NLP statistics, compares and contrasts the two newspapers. Do the two newspapers use certain adjectives differently? What does that tell us about the “slant” of that newspaper? There are lots of interesting questions you could ponder.

I decided to pursue a small creative NLP project just for the hell of it. The idea was to perform basic NLP on a 2,955-word short story I wrote in a Creative Writing class last semester, and to just get a feel for what a basic exploratory analysis might look like. The short story is titled Ron Rockwell. I am going to include the Python code I wrote at each step with some added commentary, so you can follow along and reproduce the study if you want. To download all the code I used at once, just see the script I have on GitHub.

Before I do anything, I need to import some external libraries that we’ll need for the exercise, and I do this by running the following code.

I wrote my short story in Microsoft Word (download link). To convert it to a text file, I just copied & pasted the story from Word into a .txt in Notepad++, then manually deleted any strange characters that appeared in translation . Now we have an easy-to-work-with format: the entirety of Ron Rockwell in just a text file (download link). Let’s open it with Python.

Next, I’ll write a function that further preprocesses the text. The output is an object that consists of all the words found in Ron Rockwell as word “tokens“.

['\xef', 'that', 's', 'it', 'for', 'today', 'remember', 'homework', 'is', 'due', 'next', 'class', 'and', 'don', 't', 'forget', 'we', 'have', 'a', 'test']

The output displays 20 of our tokens.

Let’s define a short function to identify an introductory metric for our story. The Lexical Diversity represents the ratio of unique words used to the total number of words in the story.

0.3222842139809649

Let’s check our math. How many tokens are there, total?

3047

How many unique tokens are there in that set?

982

982/3047 is equal to a .32 Lexical Diversity score.

The FreqDist() function turns our set of tokens into a Frequency Distribution object, giving us the frequencies of all tokens in my story.

FreqDist: 'the': 108, 'and': 105, 'to': 98, 'a': 82, 'of': 69, 'ron': 61, 'in': 56, 'i': 47, 'was': 47, 'his': 45, ...

We can plot this frequency distribution:

figure_1

Not many of those words are useful for analysis. “The”, “and”, “to”, “a”, and “of” are all used in the English language to provide basic sentence structure. We want to eliminate these kinds of words and only focus on the words that have significance in giving us information about the story. These words are called “stopwords“, and we are going to eliminate them by importing a pre-made list of stopwords and eliminating them from our list of word tokens.

['\xef',
'today',
'remember',
'homework',
'due',
'next',
'class',
'forget',
'test',
'friday',
'everything',
've',
'learned',
'chain',
'rule',
'll',
'office',
'wednesday',
'twelve',
'two']

How many total words are left?

1542

How many unique words made the cut?

876

Let’s turn our stopword-free list of tokens into a Frequency Distribution object.


FreqDist: 'ron': 61, 'molly': 25, 'calculus': 16, 'students': 15, 'class': 13, 'time': 13, 'mathematics': 10, 'really': 10, 'wasn': 10, 'know': 9, ...

Now we’re getting somewhere. Let’s see a list of the top 20 most common non-stopwords in my short story.


[('ron', 61),
('molly', 25),
('calculus', 16),
('students', 15),
('class', 13),
('time', 13),
('mathematics', 10),
('really', 10),
('wasn', 10),
('know', 9),
('life', 9),
('office', 9),
('professor', 9),
('took', 9),
('college', 8),
('wanted', 8),
('felt', 7),
('get', 7),
('like', 7),
('things', 7)]

This makes a lot more sense. The top two most common words are the names of the two main characters, Ron and Molly. The third most common word, “calculus”, describes how those characters are connected: Ron is the professor of Molly’s calculus class. The next four most common words, “students”, “class”, “time”, and “mathematics” seem to make sense in this context.

We can also plot this frequency distribution:

figure_2

OK, that’s pretty neat. Now let’s take a measure of how rich our vocabulary was with the stopwords removed.

0.5680933852140078

That’s higher than the .32 score we saw earlier. Removing all those common stopwords like “a” and “the” is responsible for this increase.

Ok, so in order to save some of this data for later (in this case, the top 20 most common non-stopwords), we’ll turn that data into a list, which we’ll then use to write out a .csv file.

Finally, we can take this data and export it as a .csv file for further analysis in R or Excel.

You can download the files here as proof that this worked:

tokens.csv
newfile.csv

So that’s that. Obviously I didn’t do anything groundbreaking, but I did set up a platform for processing language files into tokens for NLP analysis. And we discovered that, after removing the stopwords, the three most common words were the names of the story’s two main characters (“Ron” and “Molly”), and the primary characteristic they have in common (“calculus”). SPOILER ALERT: the short story is about Ron, Molly, and their relationship.

So, while I’ve haven’t done much more than rookie-level NLP here, I think this kind of analysis can deliver some cool insights if it gets applied to an interesting dataset.

3 thoughts on “Using Python to Perform Lexical Analysis on a Short Story

  1. Jason Katz says:

    Like it dude! What other languages have you been using?

    • admin says:

      Thanks Jason.

      I’ve been using mostly R and Python to do scientific computing & data analysis. There are so many useful libraries out there for free, it’s great.

  2. […] I can use coding skills to look at things from new perspectives. […]

Leave a Reply

Your email address will not be published. Required fields are marked *