Programming

Superfund Package

Mainly based on some of my own research, but also as a resource for teaching, I pulled together a small data package for R a little while ago called SuperfundR. It contains Superfund site data for the United States that pulls data from the Environmental Protection Agency, and does some normalization to keep things tidy. I plan to keep things up-to-date for a while, and if I have a chance maybe I’ll write up a walkthrough on creating R data packages. The main table looks something like this.
library(tidyverse)
library(superfundr)
superfunds
#> # A tibble: 66,386 x 20
#> site_name epa_id city county state zipcode region npl_status
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> 
#> 1 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 2 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 3 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 4 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 5 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 6 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 7 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 8 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 9 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> 10 ATLAS TA… MAD00… FAIR… BRIST… MA 02719 1 Currently…
#> # … with 66,376 more rows, and 12 more variables:
#> # superfund_agreement <chr>, federal_facility <chr>, op_unit_no <dbl>,
#> # seq_id <dbl>, decision_type <chr>, completion_date <dttm>,
#> # fiscal_year <dbl>, media <chr>, contaminant <chr>, address <fct>,
#> # latitude <dbl>, longitude <dbl>
The data is structured just as it comes from the Environmental Protection Agency, which lists out each contaminant at each site. SuperfundR adds additional information from the EPA’s basic spreadsheet, including latitude and longitude coordinates and addresses, and converts data as necessary (title case for text, dates as date objects, etc). This makes it easy to count the number of contaminants across sites.
superfunds %>%
group_by(contaminant) %>%
tally(sort = TRUE)
#> # A tibble: 663 x 2
#> contaminant n
#> <chr> <int>
#> 1 ARSENIC 2667
#> 2 LEAD 2531
#> 3 TRICHLOROETHENE 2049
#> 4 BENZENE 1659
#> 5 TETRACHLOROETHENE 1645
#> 6 CHROMIUM 1589
#> 7 CADMIUM 1538
#> 8 ZINC 1380
#> 9 MANGANESE 1288
#> 10 TOLUENE 1268
#> # … with 653 more rows
Or do things like count the number of active or inactive sites.
superfunds %>%
distinct(site_name, .keep_all = TRUE) %>%
group_by(npl_status) %>%
tally(sort = TRUE)
#> # A tibble: 7 x 2
#> npl_status n
#> <chr> <int>
#> 1 Currently on the Final NPL 1141
#> 2 Deleted from the Final NPL 362
#> 3 Not on the NPL 32
#> 4 Proposed for NPL 3
#> 5 Removed from Proposed NPL 2
#> 6 Site is Part of NPL Site 2
#> 7 <NA> 1
This is an open source project, so I’d welcome any contributions folks would like to make.

BootcampR

This semester I am piloting a new six-week workshop series on the R programming language called BootcampR.

I’ve been teaching R workshops for a few years now and I’ve seen a few things that keep recurring in these. First, I seem to run out of time. Every time. So, the easy fix is to make the workshop a little longer – of course, I want to be respectful of people’s schedules, so I didn’t add much time to the workshops. But expanding from one hour to an hour-and-a-half might help make these workshops a bit more managable.

One thing that does work well is the hands-on component of the workshops. When I co-taught my first R workshop at the Digital Humanities Summer Institute a few years ago with Lincoln, we developed RMarkdown worksheets as a way to interactively work through the language together with the class. I’ve continued to develop these worksheets, and I think overall it works pretty well for getting people hands-on with the language in addition to including additional information and explanation about why they’re doing certain things with the language.

For a while now I’ve been teaching the tidyverse to R novices, but I’m trying something new this year: in the first two workshops, all of the work we’re doing is happening in Base R. In week three, we’ll start learning the tidyverse – and what I’m hoping to achieve here is a strong contrast between Base R Ways™ and Tidyverse Ways™ of doing the same task. My goal here is two-fold: to reiterate that there are multiple ways to do the same task in programming, and to show that there’s a cleaner and easier way of doing the task instead of using Base R. This is the highly opinionated section of the workshop: I remain convinced that the way to work with R is by using tidyverse methods.

Finally, the last thing the workshop series is trying to do is build upon itself. In previous workshops, I’ve had to cram in a lot of information: an intro to R, to RStudio, and to the Tidyverse is a lot to fit into one hour. I’ve now broken out the intro to R material into its own workshop, which means by the time we get to class for the Tidyverse I can spend a lot more time explaining how the methods work and why you’d want to use them.

All of the content I’m creating for the course is released CC-BY, so please feel free to use anything I’m creating. Included for each workshop are readings (to be completed before the workshop, so we’re all coming at this with some prior knowledge), an interactive worksheet, resources for after the workshop to keep practicing or read more explanation, and slides from the lecture portion.

It’s the pilot version of this series and I’ll be assessing how it all went at the end. But so far, it seems to be going smoothly.

Teaching the tidyverse to R novices

This semester I am running my R workshops once again, and as always I start by teaching people the packagers of the tidyverse. As part of Endangered Data Week, I am teaching two workshops introducing beginner R programmers to data tidying/manipulation and data visualization. I’ve taken this approach to using the tidyverse instead of base R for two primary reasons. First, learning how to manipulate data with dplyr and tidyr is easy to understand conceptually and often easier than learning the idiosyncrasies of R. When I show students two lines of code that achieve the same thing in base R and dplyr, I’ve always gotten the same answer: the dplyr way is much easier to read and understand. Tidyverse I’m not alone in my approach here — David Robinson has made the same case in regard to ggplot2. My rationale largely follows his: that teaching students the basics of the tidyverse means they can be up and running with a powerful set of tools quickly. In the case of Endangered Data Week, that means introducing students to messy government data, tidying that data, working with data to produce new data, and drawing conclusions. I’m able to teach these concepts relatively quickly thanks to the power behind dplyr and tidyr. I don’t need to worry about teaching the syntax around [[]] or $ or c(). If students need base R techniques or have questions, they can always get in touch with me for more pointers. For our data manipulation exercises in our workshop, we work off an RMarkdown worksheet together during the session. I provide them with some population data I compiled for a project I worked on last year and we work through most of the functions available in dplyr and tidyr—and if we don’t get through it all, that’s fine; they have the worksheet to complete on their own time. (I make teaching these workshops a little easier for myself by also installing RStudio Server and the necessary packages on Digital Ocean so we can be up and running quickly.) Second, students can be up and running with a good amount of knowledge about R, data manipulation, and visualization in a relatively short amount of time. After an hour-and-a-half together, even students who haven’t programmed previously are learning to work with the language. The grammar of data tidying allows these concepts to be grasped quickly since each step builds upon the previous one. Chaining together a series of tidyverse functions allows the students to see the steps necessary to reshape, clean, and explore a dataset. And those skills can be applied to any dataset, meaning students can take what they learn and use them towards other projects or classes. Likewise, I prize tidyverse methods for their consistency. I’ve seen some wild ways people have accessed or manipulated columns in a data frame (just spend some time on Stack Overflow), but anytime I read someone’s tidyverse example the process clicks faster. That consistency, again, makes using, finding answers, and learning the language that much easier. This isn’t to say I don’t teach any base R — even in the above workshops, students still learn about sum(), slice(), logical operators, and other base methods. But pairing some of the base R methods with the tidyverse makes for a powerful set of tools that can have students manipulating and visualizing data quickly. This approach of teaching tidyverse with an interactive worksheet has worked well — students are up and running with R and applying new skills quickly. My goal is to help people to work with data, and the tidyverse provides a powerful way to get started.