“Okay, Jason,” you’re asking yourself, “I’m tired of saying hello and counting numbers and doing mathematics. How can Ruby be applied to my work as a humanities scholar?” I’m thrilled you asked! Because today, we’re writing our first full program together. I’ll warn you, this might be a long read and a lot of writing. But I’m hoping by doing this we experience the process of designing, planning, writing code, optimizing code, debugging, and finally using the program.
We’re going to write a program based off a homework example we completed in Prof. Steve Ramsay’s class (To Steve’s future students: don’t copy this program. Your professor will know). We’re going to take a word frequency generator and read a file off the Internet, strips the HTML or XML tagging out of the file, generate a word frequency, and print the frequency as a new HTML file. A lot will be happening, so I hope I can carefully and concisely explain the details of our program as we go along.
One potential way to write our word frequency program is as such:
Our program takes in a file (text) and sends the file into our separation method to convert everything into a string, downcase the words for normalization, and scan for whitespace (hence the regex code /[\w']+/). Once the program read the file and converted the text into individual words, it sends the file into our word_count method and enters the file into a hash. Inside of word_count, the file counts the words and for each instance of a word adds an increment until the file has finished processing. We return number and call the sort method and assign sort values (word and count) and print our results.
There are certainly several ways to achieve the results we’re after. If you have your own word frequency generator that you’re comfortable working with, go ahead and use it. I’ll be using my own code:
You should now have a working word frequency generator. However, we want to be able to read HTML files from the web; this will make the program much more useful. To do this we’re going to import a Ruby library called open-uri and use its methods to fetch web data. Let’s first look at how we achieve the ability to have Ruby read web files before we integrate it into our frequency program. I’ll be using an XML newspaper file from one of my digital history projects – feel free to use the same or select your own file:
The above file will read the URL and print to the screen. But you’ll notice something that will inconvenience us if we try and generate a frequency: the output includes the HTML tags. We need to get rid of all that junk. There are a couple of ways to do that, but we’re going to return to our good friend regex to look for HTML tags and strip out everything we don’t want. We’ll use the gsub method and regular expressions to substitute HTML tags with empty lines. We’ll also use it to strip out punctuation marks and other HTML formatting (such as "). Make a small edit to your file:
You should now be seeing just the text of the webpage we are having Ruby read. Pretty cool, huh? But we’re not quite where we want to be yet. Let’s also get rid of punctuation and numbers as well as downcase all the text so we have a consistent word base:
Now let’s add this to our frequency generator.
Ok, run ruby frequency.rb and we should . . . wait, what happened? If you run this, you should get an error. Time to debug!
The issue is we’re not reading a file, we’re reading the contents of a variable. So, there’s no need for the File.new class. We can get rid of that. We also need to update the each method to read our URL variable:
All right, now we can run this. Type in ruby frequency.rb and . . . whoh. Something still isn’t right. You should be outputting some sort of frequency counter, but the program is counting lines rather than individual words. We forgot to split the words apart. So, we’ll add the split method:
Before we move on, let’s clean things up a bit. Let’s move our URL reader into a method and rewrite some code. The method should look like this:
Now we can rewrite the URL input as:
Your file should now look similar to this:
We’re also going to add a new method of inputting files by using Ruby’s ARGV method. ARGV is a global array that allows us to pass command-line arguments after the filename. So, we’ll rewrite the code above a bit:
You should now be able to run ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml in the command line. And there we have it! A working word frequency generator that can read HTML files or local files. This may be as far as you want to go, but if you’re like me, you’d love to have a program that not only generates frequencies but will also output a file that you can use. In my case, when doing digital scholarship, I want files that can be read by a browser. So, we’re going to have the frequency list export as HTML. For this, we’ll be bringing back in our File I/O method:
Let’s also let the user know where the file was exported. Add to the end of the file:
So, you’re program should now look like:
You should now be set to write to the command line ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml, which will compute the frequencies and output the results to an HTML file.
Neat, huh? Except . . . well, perhaps it isn’t that useful yet. I mean, is it really useful for us to know that the word “the” shows up 35 times? Not really. In fact, you’ve probably noticed that the majority of the highest frequencies in the list are common words (this is known as Zipf’s Law). Let’s get rid of those.
We’ll start by creating an array of common words. Let’s also make it a constant variable so we don’t have to worry about override problems. Remember that we stripped out punctuation, so we need to maintain the words without apostrophes:
Now we’ll add this to our readFile method and tell Ruby to remove words that appear in the array:
The program should now remove words that appear inside of the stopwords array. Now we have something a little more useful to us.
So, the program in its entirety should now look like:
Simply type in ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml and the program will output an HTML file and confirm the successful completion of the program. Congrats! You now have your first full Ruby program. Do some hacking on this program. Add a function or feature to it or optimize the code and see what you can accomplish. Perhaps, for example, you want another method so you can output an HTML file that generates keywords in context or a word cloud. Or, if you’re really ambitious, maybe you can learn about Ruby on Rails and make this program run as a webpage rather than the command line.
If you’ve stuck through reading The Rubyist Historian to the end, you should now have a working knowledge of the Ruby programming language. I hope that I’ve been able to competently explain key concepts and ideas of Ruby. But we’ve only touched the surface of Ruby. There are several resources out there to continue learning about Ruby. I would start with these:
See something that’s wrong? Examples that don’t work? Explanations that are unclear or confusing? Embarrassing typographic errors? Drop me an email at jason.heppler+feedback at gmail and I’ll fix things right up!
Topic structure, examples, and explanations for the Rubyist Historian are inspired by, credited to, and drawn from Stephen Ramsay and his course Electronic Text.