Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!
In this tutorial, we will use Python and its plotting module matplotlib to illustrate the word frequency distributions of texts. This is called Zipf's Law, which states that the frequency of words is inversely proportional to their rank and the most common word.
So this means the second most frequent word is half as frequent as the most common, the third one is one-third as common as the most common, and so on. We will analyze texts and show these frequencies in a line graph.
To get started, let's install matplotlib, NumPy, and scipy:
Let us start by importing some modules to help create this program. First, we get os
so we can list all the files in a directory with the os.list_dir()
function because we make it so we can dump .txt
files in a specific folder, and our program will include them dynamically. Next, we get pyplot
from matplotlib
, the simple non-object-oriented API for matplotlib.
After that, we get string
which we need because we will remove all the text punctuation, and we can use a string in this module. The subsequent two imports are done so we can later smooth out the curve:
Next up, we setup up some variables. texts
will hold the texts of the files accessible by their filename without the extension. The next one is pretty similar. It will just have the length of each text. textwordamount
will hold another dictionary where each key represents a word and the value of its number of occurrences in the text. After that, we get the punctuation
variable from string
and store the list in a variable:
After that, we also define how deep we want to go with the check of the occurrences, and we also define the x-axis because it will always be the same ranging from 1 to 10 (or whatever we specified as depth):
After the setup, we can finally start by getting the texts. For this, we first need the filenames of all files in the texts
folder. You can name this whatever you want. Then we loop over these names, open them, and read their content into the texts
dictionary. We set the key to be path.split('.')[]
so we get the file name without the extension:
Now we continue cleaning the texts of unwanted characters and counting the words. So we loop over each key in the dictionary, and for this loop, respectively, we loop over each undesirable character, replace it in the text, and set the text at the specified key/value pair to this new and cleaned string:
After that, we split the text by empty spaces and save the length of this list to our textlengths
dictionary:
Continuing, we make a new dictionary in the textwordamounts
dict to store the occurrences for each word. After that, we start a loop over the words of the current text:
Then we check if the current word is in the dictionary already. If that's the case, we increment up by one, and if not, we set the value to one, starting to count from here:
After that, we sort the dictionary with the sorted()
function. We transform the dict into a list of two-item tuples. We can also define a custom key that we have to set to the second item of the supplied objects. It's a mouthful. We do this with a lambda. We only get a limited number of items back specified by depth
.
Now we make two helper functions. We start with the percentify()
function. This one is pretty simple. It accepts the value and the max, and it calculates the percentage.
The following function is used to smoothen out the curves generated by the word occurrences. We won't go in-depth here because this is purely cosmetic, but the takeaway is that we use NumPy arrays and make_interp_spline()
to smooth it out. In the end, we return the new x and y-axis.
So we can compare our texts with Zipf, we now also make a perfect Zipf curve. We do this with a list comprehension that will give us values like this: [100, 50, 33, 25, 20, ...]
. We then use the smoothify()
function with this list, and we save the x and y axis to use it then in the plot()
function of pyplot
. We also set the line style to dotted and the color to grey:
Let us finally get to display the data. To do this, we loop over the textwordamounts
dictionary, and we get the value of the first item, which will be the most common word for each text.
Then we use our percentify()
function with each value of the amount dict. Now, of course, for the most common value, this will return 100 because it is the most common one, and it is checked against itself.
After that, we use this newly made y-axis and pass it to our smoothify()
function.
Last but not least, we plot the data we made and give it a label that shows the text name and the word amount. We set the opacity with the alpha
parameter.
And after the plotting loop, we set the x ticks to look correct. We call the legend()
function, so all the labels are shown, and we save the plot with a high dpi to our disk. And in the very end, we show the plot we just made with show()
.
Below you see the showcase of our little program. As you see, we dump many text files in the texts
folders and then run the program. As you see. Password generator tutorial and Simple Text Editor with Tkinter are pretty close to the perfect curve, while Planet Simulation with PyGame is way off.
Excellent! You have successfully created a plot using Python code! See how you can analyze other data with this knowledge!
Learn also: How to Create Plots with Plotly In Python.
Happy coding ♥
Finished reading? Keep the learning going with our AI-powered Code Explainer. Try it now!
View Full Code Transform My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!