Data Curation as Publishing for the Digital Humanities
“Publishing” has assumed a large role in discussions of how scholarship is changing. One reason is that, in these discussions, the mechanisms of publishing come to stand in for the larger and more complex processes of creating, vetting, and circulating knowledge. Some of the sense of unmet need that arises in considerations of the emerging, alternative publishing methods for those working in digital humanities comes from the problems with this shorthand.
Meeting the publishing needs of digital humanities scholars is challenging not only because the outputs may take new forms—digital, “database-driven,” or somehow online—but also because some of the publishing that digital humanists want or need to do encompasses processes of knowledge creation, dissemination, and exchange that “publishing” does not encompass in its current forms. To put it another way, if we are focused on outputs then we can probably accept “publishing” as a shorthand for the larger processes of scholarship: workflows, peer-review, marketing, production need to be adjusted certainly for new forms but the framework of the conversation can remain the same. Yet, if we examine the work that humanists are doing—in something like the way that scholars in the field of Science and Technology Studies (STS) have done for science—by looking at their culture of material practices, then the familiar framework of “publishing” does not serve us well. The interplay of theory, data, and computational methods in a significant portion of digital humanities scholarship works in such a way that to publish this scholarship requires that we add some new dimensions to our ideas of “publishing.”
I want to suggest that the theory and practice of data curation can augment our notion of “publishing” in a way that will serve the needs of the digital humanities community. The work of data curation—“active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; … activities [which] enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time” (Cragin et al. 2007) —should be legible as “publishing” work for libraries and scholars to do in much the same way that well-understood tasks related to preparing and circulating monographs or journals are already legible as publishing work. Moreover, I argue that articulating connections between “publishing” and data curation is important in the context of strategic decisions that libraries make about how to participate in “publishing.” Data-curation-as-publishing is publishing work that draws directly on the unique skills of librarians and aligns directly with library missions and values in ways that other kinds of publishing endeavors may not. 
The link between data curation and publishing is not wholly new. In 2009, Joyce Ray, Sayeed Choudhury, and Mike Furlough presented a paper at the Charleston Conference, summarizing several strands of contemporaneous work. The paper was entitled “Digital Curation and E-Publishing: Libraries Make the Connection”. In this paper, Ray, Choudhury, and Furlough describe how data curation and publishing can be mutually-reinforcing activities. They write:
We have on the one hand, a community, or a subset of several communities, that has been working on the “back end” of digital production from the generation of raw data to the construction of an organized product that can be accessed, and, on the other hand, another community—publishers—who work on the “front end” of scholarly communications, from manuscripts to publication (Ray, Furlough, Choudhury 477).
Making the connection of the title involves bringing these communities together as complementary elements of a service portfolio, staffing model, and infrastructure that justifies the funding and the relevance of libraries in a changing scholarly environment. This is a good argument and some innovative libraries (among them Penn State, Johns Hopkins, New York University, and Purdue) seem to be having some success with this as a strategy. The main thrust of the argument that Ray, Choudhury, and Furlough advance is managerial, bringing together libraries and (university) publishers (under the aegis of the library) as an attempt to rationalize the “business” that the combined library/publisher is in. Treating data curation and publishing as kindred services may offer the prospect of expanding a library’s stable of “innovative” offerings while not straining resources because there are management efficiencies in having both the “front end” and “back end” people in the library. However, in this model, neither libraries nor publishing seems truly transformed and this is a problematic mismatch when so many other aspects of scholarly work are being transformed.
I argue that it is possible, even preferable, to treat the connection between data curation and publishing as more fundamental. As Ray, Choudhury, and Ray themselves say: “Digital curation is a useful label for that collection of challenges newly located at the intersection of publishing, collections development, preservation, and the humanities” (Ray, Furlough, Choudhury 479).
Data Curation and Digital Humanities
As an enterprise, “digital humanities” (formerly “humanities computing”) dates back to the late 1940s (debatably, even earlier) and, since at least the 1980s, the curation of digital humanities research data has been an associated area of research, activity, and concern. Many of the early genres of digital humanities scholarship re-enforced this connection between digital humanities scholarship and data curation: “development of indices, annotated linguistic corpora, and digitally encoded texts—in other words, the preparation, collection, organization, and maintenance of datasets” (Palmer, Weber, Muñoz, and Renear). For a fuller account see Palmer, Weber, Muñoz, and Renear, 2013. In a piece for the Chronicle of Higher Education in 2002, Jerome McGann predicted that “in the next 50 years, the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination” (B7). Many have taken McGann’s forecast as both a description of and a call to the work of digital humanities. This vision of digital humanities contains clear parallels with the goals of data curation practitioners and teachers aiming “to build and maintain not only digital libraries and curated data sets, but also the associated indexing systems, metadata standards, ontologies, and retrieval systems” (Palmer, Renear, and Cragin 2008 3). The point here is not that digital humanities and data curation have a special affinity—there are long and rich histories of data curation in the sciences and social sciences. My point is that data curation has been part of the ambit of digital humanities for a long time and this should guide us in thinking about how to publish digital humanities work.
In referring to “data curation,” I am speaking specifically of information work that integrates closely with the disciplinary practices and needs of researchers in order to “maintain digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research.” (Munoz and Renear 2011) This distinguishes data curation from many near synonyms: digital curation, digital stewardship, digital preservation. The emergence of a specific discourse on “data curation” in the sciences (with accompanying policy development, funding, and new research interest) provides a framework for pulling together diffuse and disparate activities in the humanities and describing these activities under the new rubric.
Thus, digital humanists in particular are becoming increasingly aware of data curation issues and data curation needs as part of the way they (we) work. This element of digital humanities work is becoming prevalent enough that I selected examples more or less at random from items that I came across in my professional networks and social media feeds over the period of a week in the spring of 2013. Lincoln Mullen, a PhD candidate at Brandeis University, posted on his blog about using the statistical programming language R for historical research. As part of his discussion, Mullen describes how he converted the tables from a monograph he found in his research to a series of comma-separated-values (CSV) files in order to produce graphs and charts of the changing demographics of American religion. Along with his analysis and the blog post about his methods, he posted the (small) data set to Github, a platform for sharing open source software code and open data. Ted Underwood, Associate Professor of English at the University of Illinois, has made the work he and a graduate assistant have done building, cleaning, normalizing, and labeling a data set drawn from the HathiTrust corpus a significant part of the output of his “Uses of Scale” project and other professional presentations. Kathleen Fitzpatrick has argued that humanists “might … find our values shifting away from a sole focus on the production of unique, original new arguments and texts to consider instead curation as a valid form of scholarly activity” (Fitzpatrick 79). Fitzpatrick uses “curation” here as a near synonym of selection after the manner of an gallery or museum curator selecting art for an exhibition—a slightly different meaning than I have been developing. The examples above of computational work with datasets draw in additional meanings of curation related to information science.
It is also increasingly common to see the release of open data sets as enticement to attract digital humanists to work on particular sets of questions, or in partnership with cultural heritage organizations—see, for example, the IndexCat data from the National Library of Medicine, a small collection of catalog records for a historical library of children’s literature, data from some of the crowdsourcing projects run by the New York Public Library, the Smithsonian Cooper-Hewitt, National Design Museum collection data, and many more examples.
At least part of the professional activity of the digital humanists and organizations above involves making data available and suitable for re-use. As any of the researchers involved would no doubt say, curation of these data sets takes time, effort, and money. Libraries getting involved to help digital humanists do this kind of work would be offering something of value. This would be “publishing” not only in the sense of registering and “making public” a product of scholarly work, but also in the sense of ensuring quality and disseminating outputs to interested communities.  By recognizing data curation work as a publishing activity, libraries would have a “market opportunity” to address unmet needs in the digital humanities community (among others).
Distinguishing data-curation-as-publishing (a new and more-broadly conceived activity suited to the kinds of knowledge production and dissemination happening in digital humanities) from data as merely another form of publication is a crucial point. In a recent publication in Data Science Journal, Mark Parsons and Peter Fox explore “data publication” as a metaphor for the kind of things that scholarly communities want to see happen with data. They explain that “Data Publication builds from the familiar and conceptually simple model of scholarly literature publication” (WDS37) and they capitalize the terms deliberately to indicate the status of this phrase as “a recognized metaphor and data management paradigm” (WDS33). Parson and Fox’s paper elaborates on what are some significant problems in adopting this metaphor. In the limited space available I want to focus on just one of these problems. Parsons and Fox note that under the model of Data Publication “publishers are distributed and can act autonomously or in concert” (WDS37). Thus, they write:
there is … little emphasis on data discovery and interoperability across systems. Data are often presented as they were created without explicit considerations of data integration or significant reuse. … The attention is on preservation and formal recognized scholarly contribution with less attention to … issues such as latency, rapid versioning and reprocessing, and computational demands (WDS37).
To understand data-curation-as-publishing (which I’m advocating as a way to serve digital humanities scholars) only as “Data Publication” expands recognizable publisher and library activities to a new class of scholarly objects (data) but in many ways perpetuates the (flawed) status quo.
Within the critique of “Data Publication” there are glimpses of what it could mean to treat the activities of data curation as “publishing” activities in a way that would benefit both scholars and libraries. The first part of Parson and Fox’s critique is that under the model of “Data Publication” there is “little emphasis on data discovery and interoperability across systems” (WDS37). Various examples from the media landscape beyond scholarly publishing suggest the truth of this claim. In the realm of ebooks, the importance of outlets like Amazon and other digital dissemination channels has recently forced publishers to pay greater attention to “discovery” and to devote more resources to things like metadata. But at the same time, the fracturing and proliferation of ebook reading platforms is an ongoing example of problems of interoperability across systems in a publishing marketplace (there is a similar shape to the story of the relative fortunes of the on-demand video company Netflix and various real or rumored video platforms implemented by specific studios or content creators). This leads to the question of whether lack of emphasis on discovery and interoperability are intrinsic to the business of publishing (presumably because the energies of publishers are directed elsewhere to activities considered more vital to mission and survival)? Attention to “discovery” and related issues of interoperability across systems are traditional and persistent features of library work.
This is where Ray, Choudhury, and Furlough see opportunity for libraries who can “make the connection” between publishing and data curation—to excel where traditional publishers have not—by having both “back end” (librarian) and “front end” (publisher) organizational capacity. In their discussion of “back end” and “front end,” these authors map the organizational rationalization of connecting data curation and publishing as library activities onto existing lifecycle models of data (this is explicit in the paper) and onto an extended lifecycle model for scholarly publications (in the example they give, journals and monographs) that encompasses both distribution and long-term preservation. “Putting a standard monograph series online didn’t make the Library a publisher,” they write, “but it linked the Library’s role as a preservation agent more directly to its emerging role as a distributor” (Ray, Choudhury, and Furlough 480).
Yet, for as much as these early sketches of programs (for which the authors deserve credit as pioneers in making any such moves in this direction) emphasize the interdependence and mutual reinforcement of curation and publishing, this vision of scholarly work with data is still somewhat disappointingly static and familiar. Publishers add value to end products through peer review and high quality production and presentation. Libraries standardize and preserve these outputs and continue to make them available to a community over time. Organizations which comprise both library and publisher can imagine this as a unified suite of services that cover the entire data lifecycle. However, if Data Publication, rather than data curation is the governing metaphor, this alignment, just having both “back end” and “front end” of the process, may not be sufficient to avoid falling into traps such as neglecting discovery and interoperability of digital humanities work with data. It is worth noting, too, that a Data Publication model does not easily encompass “issues such as latency, rapid versioning and reprocessing, and computational demands” that resemble precisely the kinds of demands that digital humanists are likely to make. (Parsons and Fox WDS37).
This leads to the next part of Parsons’s and Fox’s critique—under a model of Data Publication, “data are often presented as they were created without explicit considerations of data integration or significant reuse.” (WDS37) Data “presented as they were created” sounds like a description of researcher self-deposit into (institutional) data repositories—currently the most common form of library engagement with data curation and embedded in both the Johns Hopkins and Penn State models. Libraries cannot adopt a position of becoming data publishers (via repository provision) in the way some are seeking to become journal publishers through the provision of platforms like Digital Commons and similar initiatives. That Parsons and Fox single this problem out in a discussion of why Data Publication is a problematic metaphor from the perspective of solving the real information needs of researchers suggests that while the provision of institutional data repositories is necessary and important it is not sufficient to “purposeful work” with data. So, libraries cannot stand pat; they cannot maintain only the “back end” of these processes but must make the connection to more active engagement.
Ray, Choudhury, and Furlough present a fully-developed model of the data lifecycle (developed by the UK Digital Curation Centre) but offer only a loose schematic of a publishing lifecycle. To suggest how data-curation-as-publishing may help expand the notion of “publishing” in a way that would allow libraries to break out of their traditional role at the “back end,” I would add another model to the discussion. Historian Robert Darnton proposed a model of what he-called “the communication circuit” in a seminal paper from 1982 entitled, “What is the History of Books?”. Darnton’s model has been critiqued and elaborated by other book historians (including Darnton himself) since the first publication but the original version will suffice to advance the argument about data curation and publishing. What is salutary about Darnton’s model and what makes it an interesting partner for data lifecycle models is that it includes, in addition to authors and publishers, printers, shippers, booksellers and readers as part of the scope for understanding books as texts and objects.
Part of the unique contribution of book history as “an interdisciplinarity run riot” has been to reveal the agency of those who package, categorize, organize, disseminate, receive, re-use, and interpret in the co-creation of meaning and knowledge (Darnton 67). I am not convinced that Ray, Choudhury, and Furlough’s merged organization capable of both “front end” and “back end” work is sufficient to cover the stations of this richer model. I believe libraries should treat data curation activities as “publishing”—worthy of new enthusiasm and new resources—but they (we) should be wary of framing the endeavor as “data publishing” (an analog to journal and monograph publishing). By taking “publishing” as a category to be re-imagined rather than a pre-existing workflow to stepped into, libraries can and should take a more active role in work with data—this is data curation-as-publishing.
Developing Capacity for Data Curation-as-Publishing in Libraries
What form might this actually take? To offer a specific example from digital humanities, data-curation-as-publishing might look something like the Alexandria Archive Institute’s Open Context project. Open Context provides review, documentation, and publication of research data, mostly in the discipline of archaeology. The “About” page of the project web site speaks of “data sharing as publication” and a flavor of the work the project carries out can be gleaned from a representative sample of the editors’ blog, which discusses matching date files with code books, cross-checking values, annotating, and describing the data set. The editors remark that “data sharing requires similar levels of effort and professionalism as other more conventional forms of publication.” Open Context is hosted and administered by the non-profit Alexandria Archive Institute and thus represents a kind of freestanding example of an organization doing data-curation-as-publishing. What would it mean to locate this effort in libraries?
This question recalls a point that Choudhury, Furlough, and Ray make in passing. In describing the creation of the Data Conservancy architecture and service at Johns Hopkins, they write: “It is especially important to note the role of a particular individual at AAS who acted as the human ‘interface’ between the various players. This individual could easily be classified as a ‘data scientist’ – an individual with knowledge of a specific domain or discipline yet also a deep knowledge of data management” (479). They go on to remark that “libraries would be wise to consider developing such expertise and capacity in-house” (479). I contend that Open Context, and its editors, represent another example of this kind and that libraries should be figuring how to set up and host such activities. Developing the capacity to partner in this more broadly conceived version of publishing that digital humanities and other data intensive disciplines increasingly need will require libraries to alter how they relate to collections.
At the University of Maryland Libraries, those working on data curation are beginning to work on the question of how to make a case to subject selectors (who control collection budgets to support various disciplines) to spend collection funds on curation work for significant data sets. These discussions are still at early stages—there is lots to figure out including what specifically should appear on “the invoices” for such data curation work that selectors are being asked to pay—but libraries who wish to engage seriously with support for data-intensive research (like the digital humanities) will increasingly need to sell and buy such services. These funds will need to come from collections (because that is still where the bulk of the budgets reside) and accomplishing this shift will entail breaking up many of the present economic “realities” that shape libraries’ collection development.
I am not envisioning allocating funds to buy datasets—perhaps from a vendor or platform that makes them available. The valuable work, the work that libraries should own, is the type of activity like those the editors of Open Context perform. The current situation in which libraries purchase subscriptions to large databases of, for example, journal articles, represents not only an unsustainable economic situation but also an unsustainable professional one in which libraries outsource the expertise and experience of collecting, normalizing, organizing, and making available scholarly information. Librarians should spend more time on creating metadata, building catalogs, developing and refining indexes, and building, organizing, and maintaining collections than on negotiating publisher contracts or teaching the details of interfaces created by vendors. Extending library, archive, and information science practices for data may include aggregating data sets, cleaning and normalizing values, and annotating data with controlled vocabularies and ontologies. The issues of description, organization, and access for data are still largely unsolved and libraries should demonstrate their expertise in solving these challenges through developing and sustaining data curation-as-publishing programs.
Data Curation-as-Publishing Aligns with Library Missions and Values
There is a clear need for data curation work. Perhaps this should be its own strategic initiative for libraries to pursue in parallel with “publishing” initiatives? (This is in fact what many libraries are doing.) Yet, with the financial support for libraries in flux, how many strategic initiatives can libraries count on and expect to do well? Data curation activities are fully legible as “publishing”—meeting the same ends and goals and potentially contributing to scholarship in the same kinds of ways. Also “library publishing” is a site of buzz and activity and potential investment—partly, as I have argued, because the processes and products of publishing stand in for “scholarship” writ large. I would argue that if libraries are going to invest resources in “publishing,” then that money should be spent partly on doing data curation work because data curation-as-publishing offers the most value to both researchers and libraries.
Data curation-as-publishing is the right form of publishing for libraries to be in because the work of data curation aligns with libraries’ missions and values in ways that other kinds of publishing ventures do not. There is much about scholarly “publishing” as it exists now that is not about making knowledge public or ensuring quality of that knowledge or disseminating it to those who need and could use it. There is a great deal of “publishing” that is about issues of prestige, labor, and equity of the disciplinary professions. In my opinion, libraries don’t really have a dog in that fight and shouldn’t spend resources trying to fix those problems.
In a recent paper in the library and information science literature on assessing data value, Carole Palmer, Nic Weber, and Melissa Cragin remind us that “the library and information science meta-science perspective articulated by [Marcia] Bates has always been fundamental to the role of providing broad, useable information collections and services, especially to support interdisciplinary research” (1999). Doing data curation work (like that described above) needs the unique training and skills of librarians and other information professionals and it supports the goals and values of the profession in making information accessible and usable to communities of users who need it. Making data curation fully legible as publishing, and investing in data curation-as-publishing, can help make problems of data discovery, interoperability, and re-use less daunting and show a clear way for the library to be a publisher in ways that research communities like digital humanities need.
Originally published by Trevor Muñoz on May 30, 2013 and revised for Journal of Digital Humanities November 2013.
Choudhury, Sayeed, Mike Furlough, and Joyce Ray. “Digital Curation and E-Publishing: Libraries Make the Connection.” Charleston Library Conference (2012): n. pag.
Cragin, Melissa H. et al. “An Educational Program on Data Curation.” Science and Technology Section of the Annual American Library Association Conference. Vol. 25. Washington, DC: N. p., 2007. Print.
Darnton, Robert. “What Is the History of Books?” (1982): n. pag. http://dash.harvard.edu/handle/1/3403038. Web. 2 Dec. 2013.
Daston, Lorraine. “Whither Critical Inquiry?” Critical Inquiry 30.2 (2004): 361–364. CrossRef. Web. 2 Dec. 2013.
Fisher, Barbara. “When Not Saying ‘No’ Is Negligence.” Inside Higher Ed. 14 Nov. 2013.
Fitzpatrick, Kathleen. Planned Obsolescence. New York: New York University Press, 2011. Print.
Galison, Peter. Image and Logic: a Material Culture of Microphysics. Chicago: University of Chicago Press, 1997. Print. http://www.worldcat.org/title/image-and-logic-a-material-culture-of-microphysics/oclc/36103882
Hockey, Susan. “The History of Humanities Computing.” Companion to Digital Humanities. Hardcover. Ed. Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell Publishing Professional, 2004. Blackwell Companions to Literature and Culture.
Latour, Bruno, and Steve Woolgar. Laboratory Life: The Construction of Scientific Facts. Princeton, N.J.: Princeton University Press, 1986. Print. http://www.worldcat.org/title/laboratory-life-the-construction-of-scientific-facts/oclc/13124747
Mullen, Lincoln A. “Using R to Chart the Historical Demography of American Judaism.” 16 May 2013.
Muñoz, Trevor, and Allen Renear. “Issues in Humanities Data Curation.” (2011): n. pag. ideals.illinois.edu. Web. 9 Feb. 2013.
Palmer, Carole L. et al. “Foundations of Data Curation: The Pedagogy and Practice of ‘Purposeful Work’ with Research Data.” Archive Journal 3 (2013): n. pag. Web. 2 Dec. 2013.
Palmer, Carole L., Allen H. Renear, and Melissa H. Cragin. “Purposeful Curation: Research and Education for a Future with Working Data.” (2008): n. pag. www.ideals.illinois.edu. Web. 9 Feb. 2013.
Palmer, Carole L., Nicholas M. Weber, and Melissa H. Cragin. “The Analytic Potential of Scientific Data: Understanding Re-use Value.” Proceedings of the American Society for Information Science and Technology 48.1 (2011): 1–10. Wiley Online Library. Web. 2 Dec. 2013.
Parsons, M. A., and P. A. Fox. “Is Data Publication the Right Metaphor?” Data Science Journal 12 (2013): WDS32–WDS46. Print.
-  This piece was originally posted as an edited version of a presentation given at the CIC Center for Library Initiatives Annual Conference, May 22–23, 2013. The theme of the conference was “alt.pub.edu: Emerging Options for Scholarly Publishing” and I was delighted to be part of a panel with Matt Gold (CUNY) and Matthew Jockers (Nebraska) on “Digital Humanities, Alternative Publishing Needs of Faculty.” I have made additional revisions for publication in JDH. My thanks again to all the staff of the CIC Center for Library Initiatives and to the members of the Program Committee for the 2013 Annual Conference for inviting me to speak. ↩
-  Thanks are due to Shana Kimball for prompting this extension of the argument in discussion after my original talk. ↩