Information and Data in the Digital Age

[This is a reading reflection written for HIST946: Digital Humanities with Professor William Thomas during the Fall 2011 semester. This week’s readings were John Brown and Paul Duguid, *The Social Life of Information, Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” and Roy Rosenzweig, “Scarcity or Abudance? Preserving the Past in a Digital Era”. You can find related posts here.]*

We are experiencing an age of unprecedented access to information. The vast amount of material scanned and hosted by institutional respositories like JSTOR, ProQuest, and Google offer researchers a wide array of sources for their tasks. Yet this abundance of information perhaps leads to scarcity elsewhere — namely, the methods and tools we have available to us to shed light on the information. As John Brown and Paul Duguid note, however, information necessitates interaction with people. “For all of information’s independence and extent,” they write, “it is people, in their communities, organizations, and institutions, who ultimately decide what it all means and why it matters” (Brown and Duguid, 18). Information means little without context or interpretation, or rather information serves purposes beyond just existing. Infoenthusiasts point to a supremacy in technology to tell humans what the information means. There is no need to visit dusty archives or secure travel funding. Let the keyboard be your finding aid.

To an extent, there is truth to this. As my position as project manager on the William F. Cody Digital Archive, one of our explicit goals is to provide digital sources for researchers. It is not an easy task to get to the archives in Cody, WY, but by putting these sources online and verifying the authenticity of the material we are signaling to Cody researchers that they can place their trust in the digital reproductions and, if necessary, can easily track down the physical copy of an artifact. This is the sort of democratization that Roy Rosenzweig spoke of, noting that “once [digitization] is accomplished, [sources] can be made accessible to vastly greater numbers of people. To open up the archives and libraries in this way democratizes historical work” (Rosenzweig, 756). Users have direct access to cultural records made freely available to anyone with a connection to the Internet.

This is all highly desirable: in my case, we want Cody Studies to flourish, and the building of the Cody Archive is a step in that direction, though it is not without its own digital challenges. Preservation is an especially important consideration. Having the Archive backed by a digital center inside the library perhaps means there is a good chance that the material in the archive can be preserved digitally. Aiding in that task is the adherence to encoding and documentary editing standards. I can say the same for my own smaller scale digital projects, both of which were built using standards and methods designed for longevity. However, a great deal of anxiety still exists surrounding digital artifacts. Historically, born-digital objects have been notoriously fragile, whether that is a magnetic tape, a compact disc, or a 3.5” floppy (Apple’s new Macbook Air and before those the netbooks did not contain a CD-ROM drive, perhaps signaling that the compact disc may not have much of a future left either). Digital media undergoes such rapid transformations and innovations that preservation, sustainability, compatibility, portability, scalability, and flexibility can become elusive quickly (indeed, the only digital format humans have devised that manages to maintain these categories over time is plain text).

More broadly, however, is the nature of information and knowledge in the digital. As Brown and Duguid point out, information only gains value through social use. The nature of knowledge and learning is necessarily collaborative and communication is essential for knowledge production. Knowledge is not gained by isolating oneself in front of a keyboard, but rather learning is a social process where knowledge and information are transmitted not only through physical objects (books, computers) but also by the incidental learning that occurs through observing or speaking to colleagues. There are limitations, in other words, on the information that we obtain. Technology shapes, constrains, and influences the information and knowledge we gather.

The abudance of information coupled with the power of computation means that humanistic inquiry into materials can be astounding. The coining of the “new” field of culturomics in the Science paper by Michel et al., point to the sort of usability scholars can achieve with digital data through their statistical analysis of five million books digitized by Google. Although their assertion that they are breaking new ground is annoying (since it ignores previous work), their approach to things like the appearance and disappearance of fame is interesting and raises the sort of issues humanists like to discuss.

However, the problem with Michel et al. is the problem of abudant information. They are the infoenthusiasts spoken of by Brown and Duguid. Too much faith is placed in the data and results themselves. Historical trends are not simply ebbs and flows on a graph; there is no close reading about historical events, something that is required when getting down to the granular elements of history. On the technical side, the analysis depends on the accuracy of Google’s OCR software (spend a few minutes looking at any book out of copyright in Google Books and download the plain text of the scan and you will learn quickly that OCR is not known for its accuracy). Related to this is the simple fact that computers cannot read. The “dumb book” OCRed by Google has no embedded information and cannot tell the different definitions of something like “spring”, which may speak of the season or the activity of jumping. There is also a question of how accurately Google recorded metadata on the books in its corpus, not only items like authors and titles, but dates, the publisher, the book’s format, and any other number of attributes important to historians, literary scholars, and linguists. More importantly, the data in Google Books is broad, not deep. It is obvious that “slavery” would be a common term during the Civil War, perhaps less so during Civil Rights one hundred years later. The ngrams built by Google would be more useful and interesting if it dug into more than just a small number of books — there is no searching in pamphlets, newspapers, magazines, or any other number of ways people communicated through documents. In short, Michel et al. place too much faith in information and the technology, and do not fully consider the more granular ways that information can be used and interpreted for knowledge.

We should be careful not to quantify humanistic inquiry too much. Our goal as digital humanists is not to replicate the social science, not to become the “science of the humanities”, nor to hinge our work on quantitative, data-rich sources alone. Certainly computation has a role in assisting humanistic inquiry — we cannot possibly approach the massive amounts of data or keep up with the copious studies without the aid of computers. Interpreting culture, however, will require not the number crunching of the machine but instead the human mind engaged in scholarly interpretation.

September 22, 2011 @jaheppler