Better Web Scraping with Nokogiri
When I wrote The Rubyist Historian a year ago, I was still getting familiar with the ins and outs of Ruby (and, truth be told, I am still doing so – it will take a long time before I, if ever, call myself a programmer). Looking back on the word count program I wrote at the end, I’ve realized a big error: I used regular expressions to parse a webpage.
There’s a much more effective way to do it: with the Ruby library Nokogiri. Nokogiri is built for HTML, XML, and SAX parsing and includes features that allows you to search for specific CSS3 or XPath selectors. Install the package through Ruby gems (
sudo gem install nokogiri) and you’ll be good to go.
Let’s say I wanted to do some text analysis on the books written by William F. Cody. On the Cody Archive, we currently have three of these books digitized, edited, and ready to go. They’re encoded with TEI standards and include metadata and information you might want for more sophisticated types of analysis. But to work with them, let’s say I’d like to have three clean copies of the text on my local machine without any markup included – just plain, clean, flat text files. Using Nokogiri, the process is pretty straightforward.
Fire up your text editor of choice, and write:
Now I can run it from the command line (don’t forget to
chmod the script):
For your own purposes, you may have to make some edits to the code. In my case, I’m parsing an XML file that includes the tag
text and, therefore, I tell Nokogiri to look for that with
doc.search('text'). The XML file looks something like this:
I don’t want any of the TEI header information, so using Nokogiri I can grab everything between the
text tag. This leaves me with the raw text of the XML file without any of the markup or header information.
I’m becoming more and more convinced that at least half of the work I do in digital history is cleaning up and preparing data so it can be usable. I find this little script to be handy for doing a quick grab of a site’s contents, and because Nokogiri is incredibly powerful it can chug through web pages that aren’t well-formed or valid markup. It’s faster than
wget and, unlike
wget, leaves me with plain text that I can start to work with right away.
UPDATE: The program can be simplified slightly by changing the loop: