Parsing CSV Data with Ruby
Computers offer us a wonderful opportunity to automate and ease some of the tasks associated with data cleanup and preparation in the humanities. After a conversation with a researcher at CESTA on Friday, I thought about a problem he was confronting with a spreadsheet of data he is collecting that contains many thousands of rows. In particular, he is considering some methods for data mining and text analysis. The research deals in part regarding notes taken by a British consulate about freed slaves in the nineteenth century and, in this case, he wanted a method for extracting the consulate’s notes based on criteria regarding the slaves he wrote about. It would be simple enough to do some basic work like this within a spreadsheet by sorting columns based on criteria like gender or occupation, but adding those two together (e.g., where gender is “male” and occupation is “farmhand”) becomes a trickier task in a spreadsheet.
To work around this problem, I whipped up a couple of quick Ruby scripts
this afternoon. The first script does most of the heavy lifting by
iterating through a
csv file and finding the matching options. Then,
out of those matches, I locate the column that contains the data I want
and tell the script to write to
STDOUT (or pipe the output to a text
You’ll have to play around with the script if you’re looking to use it for your own purposes. The script is very purpose built and not at all robust or inter-operable.
The script begins by offering criteria to look for. The criteria are a key
- value pairing. The key corresponds to the
csvheader (in this case,
sex) and looks for the string within that column (here,
female). The options should be able to handle as many criteria as needed, but I have not tested this thoroughly.
Then, the script opens the
csv file and iterates through
the data to locate criteria and stores the results (either
Finally, the results stored within
matches are run against
a row of data I want to collect (in this case,
contains the data that I want). Any place where
true, the script grabs the data from
row and stores the results in
row, which is
then output to the screen.
I suspect the script will become more complicated in the near future. One problem I’m thinking about already has to do with the way the researcher I mentioned at the beginning collected his spreadsheet data regarding the occupation of a freed slave. For example, some occupations are simply listed as “cook” while others contained a conjunction “cook and farmhand.” The problem with my script is it doesn’t have built-in fuzzy finding and looks for explicit strings (“cook” means “cook”, not “Cook” or “COOK”). As a quick fix for this, and as a way to check the data for accuracy, I have a second script I put together that simply counts the number of occurrences for a given row.
Nothing complicated here. The script starts by creating an empty array
then loads the data in. In this case, I want to check the data in
row. The script iterates through each row in the
csv and stores a count for each unique item in the row. The
results are displayed as a value - count pairing. Running the script
gives something like this:
The script exists more as an aid for me as I think through the previous script: I can find unique occurrences of things like gender, occupation, street addresses, and so on, which also locates misspellings or inconsistent data. Furthermore, until I add fuzzy finding to the previous script, the counting script allows me to see unique strings that I can add as search criteria to ensure I’m collecting all the data I want.