The Journal of Digital Humanities

Published by the Roy Rosenzweig Center for History and New Media

Vol. 1 No. 1 Winter 2011

ISSN 2165-6673

CC BY 3.0

Introductions

A Community-Sourced Journal, from the Editors

We’re pleased to present the inaugural issue of the Journal of Digital Humanities, which represents the best of the work that was posted online by the community of digital humanities scholars and practitioners in the final three months of 2011.

We wish to underline this notion of community. Indeed, this new journal is predicated on the idea that high-quality, peer-reviewed academic work can be sourced from, and vetted by, a mostly decentralized community of scholars rather than a centralized group of publishers. Nothing herein has been submitted to the Journal of Digital Humanities. Instead, as is now common in this emerging discipline, works were posted on the open web. They were then discovered and found worthy of merit by the community and by our team of editors.

The works in this issue were first highlighted on the Digital Humanities Now site and its related feeds. Besides taking the daily pulse of the digital humanities community—important news and views that people are discussing—Digital Humanities Now serves, as newspapers do for history, as a rough draft of the Journal of Digital History. Meritorious new works were linked to from Digital Humanities Now, thus receiving the attention and constructive criticism of the large and growing digital humanities audience—approaching a remarkable 4,000 subscribers as we write this. Through a variety of systems we continue to refine, we have been able to spot articles, blog posts, presentations, new sites and software, and other works that deserve a broader audience and commensurate credit.

Once highlighted as an “Editors’ Choice” on Digital Humanities Now, works were eligible for inclusion in the Journal of Digital History. By looking at a range of qualitative and quantitative measures of quality, from the kinds of responses a work engendered, to the breadth of the community who felt it was worth their time to examine a work, to close reading and analyses of merit by the editorial board and others, we were able to produce the final list of works. For the inaugural issue, more than 15,000 items published or shared by the digital humanities community last quarter were reviewed for Digital Humanities Now. Of these, 85 were selected as Editors’ Choices, and from these 85 the ones that most influenced the community, as measured by interest, transmission, and response, have been selected for formal publication in the Journal. The digital humanities community participated further in the review process through open peer review of the pieces selected for the Journal.

To be sure, much worthy content had to be left out. But unlike a closed-review journal it is easy to see what we had to choose from, since the trail of Editors’ Choices remains on Digital Humanities Now. Inclusion in this issue is in many respects harder and rarer than inclusion in a print or print-like journal, since it represents a tiny minority (less than one percent) of the work that digital humanities scholars made public in this period. We hope and expect that this selectivity will reinforce the value of the work included.

Even with these several layers of winnowing, the result is a sizable and wide-ranging first issue, roughly 150 pages and four hours of multimedia. The most-engaged article of the quarter was by Natalia Cecire, whose post on theory in digital humanities sparked an energetic debate and many additional posts by those who agreed or disagreed. In response, we asked Natalia to be a guest editor of a special section in this issue on the topic of her piece, which she has introduced and knitted together with responses addressing digital humanities’ awkward relationship to theory (or the lack thereof).

Beyond this special section, we have a slate of individual articles, including lengthy treatments of text mining and visualization, critical discourse and academic writing, the use and analysis of visual evidence, and a series of podcasts on humanities in a digital age. To start the issue, we have included a piece by Lisa Spiro on how to get started in digital humanities, and in what we believe is a first for the field, we end the issue with an entire section devoted to a critical engagement with tools and software.

We believe the variety of content in the Journal of Digital Humanities truly parallels the scope of work being done in the community. Because this journal is digital-first, we are able to take into account the full array of works produced in the discipline. Unlike other publications, we can, for instance, point to and review software, and we can include audio and video. We can also accept works of any length. We plan to maintain this emphasis, that there is no real or implied pressure to submit a standard essay of 5,000-10,000 words or to flatten nonlinear digital works into a print-oriented linear narrative.

Our community- and web-sourced method has several other advantages over the traditional journal model. First, as we have already noted, many more eyes have looked at the content within this volume, ranging from perhaps superficial readers—hundreds who saw and read it in their RSS readers or via social networks—to more in-depth engagements, such as those who responded in comments on the site of the original work, wrote a response on their own site, or who participated in our open review of the selected works on the Digital Humanities Now website.

Moreover, we believe this model has helpfully led to the inclusion of contributors from a wide range of stations compared to a traditional academic journal. Represented in this volume are up-and-coming graduate students already doing innovative and important work, non-academics and technologists who focus on thorny and often intellectual questions of implementation and use, and those in fields that border on or intersect with digital humanities, such as librarians, archivists, and museum professionals. We believe this is healthy for the ideas and practice of the digital humanities community, moving it beyond an insular community of mostly tenure-track academic scholars.

In that spirit of inclusion, we hope that you’ll join us in contributing to the Journal of Digital Humanities, as someone who finds and validates new work—as a daily editor on Digital Humanities Now or as a quarterly editor on the journal—or, like those whose work appears in this first issue, as someone who contributes greatly to the field by openly posting their work online.

Daniel J. Cohen and Joan Fragaszy Troyano, Editors

Getting Started in Digital Humanities, by Lisa Spiro

When I presented at the Great Lakes College Association’s New Directions workshop on digital humanities (DH) in October, I tried to answer the question “Why the digital humanities?” But I discovered that an equally important question is “How do you do the digital humanities?”  Although participants seemed to be excited about the potential of digital humanities, some weren’t sure how to get started and where to go for support and training. Building on the slides I presented at the workshop, I’d like to offer some ideas for how a newcomer might get acquainted with the community and dive into digital humanities work. I should emphasize that many in the digital humanities community are to some extent self-taught and/or gained their knowledge through work on projects rather than through formal training. In my view, what’s most important is being open-minded, experimental, and playful, as well as grounding your learning in a specific project and finding insightful people with whom you can discuss your work.

Determine what goals or questions motivate you

As with any project, a research question, intellectual passion, or pedagogical goal should drive your work. Digital humanities is not technology for the sake of technology. It can encompass a wide range of work, such as building digital collections, constructing geo-temporal visualizations, analyzing large collections of data, creating 3D models, re-imagining scholarly communication, facilitating participatory scholarship, developing theoretical approaches to the artifacts of digital culture, practicing innovative digital pedagogy, and more.

Get acquainted with digital humanities

Participate in the digital humanities community

Frankly, I think that the energy, creativity, and collegiality of the digital humanities community offer powerful reasons to become a digital humanist.

Stay informed

  • I always learn something from ProfHacker, a fantastic group blog focused on teaching, tools, and productivity. (By the way, ProfHacker was hatched at a THATCamp.)
  • GradHacker covers software reviews, discussions of professional issues, and more, from a grad student perspective. (Hat tip Ethan Watrall.)
  • Subscribe to the Humanist Discussion Group, which is expertly facilitated by Willard McCarty and has supported conversation and information sharing since 1987.
  • Check out Digital Humanities Now, which brings together current discussions and news in the digital humanities community “through a process of aggregation, discovery, curation, and review.”
  • Follow what people are bookmarking on Diigo or Delicious. (I’m a compulsive bookmarker, but not so good about annotating what I come across.)
  • Join the Digital Humanities Zotero group, which collects resources on the digital humanities. (Hat tip Mark Sample.)
  • Explore what’s going on at digital humanities centers. Check out CenterNet, an “international network of digital humanities centers.”
  • Connect with local digital humanities centers. For example, in the Great Lakes region, Michigan State University’s MATRIX builds digital collections, hosts H-Net, offers training, and more. (Hat tip Ethan Watrall).

Explore examples for inspiration and models

To find projects, see, for example,

Pursue training.

Workshops and Institutes

Online tutorials

Learn standards and best practices

If you want your project to have credibility and to endure, it’s best to adhere to standards and best practices. By talking to experts, you can develop a quick sense of the standards relevant to your project. You may also wish to consult:

Find collaborators

Most digital humanities projects depend–and thrive–on collaboration, since they typically require a diversity of skills, benefit from a variety of perspectives, and involve a lot of work.

  • Digital Humanities Commons serves as an online hub (or matchmaking service) where people can identify projects to collaborate with and projects can discover collaborators. (I’m a member of the advisory board.)
  • Talk with library and IT staff at your own institution. Although many library and IT professionals are necessarily focused on the day-to-day, there is also an increasing recognition that what will distinguish libraries and IT groups is their ability to collaborate with scholars and teachers in support of the academic mission. Be a true collaborator–don’t just expect technical (or content) experts to do your bidding, but engage in conversation, shape a common vision, and learn from each other. (Steve Ramsay offers great advice to collaborators in “Care of the Soul,” and the Off the Tracks Workshop devised a useful “Collaborators’ Bill of Rights.”) If you can bring seed funding or administrative backing to a project, that might make it easier to attract collaborators or garner technical support.
  • Reach out to others in your community. By attending a THATCamp or corresponding with someone who shares your interests, you may discover people who can contribute to your project or help shape a common vision. You could also find a colleague in computer science, statistics or another field who has common research interests and would be eager to collaborate. You might able to hire (or barter with) consultants to help out with technical tasks or provide project advice; I understand that Texas A&M’s Initiative for Digital Humanities, Media, and Culture is exploring offering consulting services in the future to help advance the digital humanities community.
  • Engage students. While there can be risks (after all, students graduate), students can bring energy and skills to your project. Moreover, working on digital humanities projects can give them vital technical, project management, and collaborative skills.
  • Consider a DIY approach. As Mark Tebeau of Cleveland Historical wisely observed at the New Directions workshop, if your institution doesn’t provide the support you need for your DH project, why not strike out on your own? As Trevor Owens suggests in “The digital humanities as the DIY humanities,” it takes a certain scrappiness to get things done in digital humanities, whether that’s learning how to code or figuring out how to set up a server. If you don’t think you have the time or skills to, say, run your own web server, consider a hosted solution such as Omeka. In the long term, it’s a good idea to affiliate with an institution that can help to develop and sustain your project, but you may be able to get moving more quickly and demonstrate the value of your idea by starting out on your own.

Plan a pilot project

Rather than getting overwhelmed by trying to do everything at once, take a modular approach. At the New Directions workshop Katie Holt explained how she is building her Bahian History Project in parts, beginning with a database of the 1835 census for Santiago do Iguape parish in Brazil and moving into visualizations, maps, and more. This approach is consistent with the “permanent beta” status of many Internet projects. Showing how a project moves from research question to landscape review to prototype to integration into pedagogy, Janet Simons and Angel Nieves of Hamilton’s Digital Humanities Initiative demonstrated a handy workflow and support model for digital projects at the workshop.

Where possible, adopt/adapt existing tools

Explore open source software. Too often projects re-invent the wheel rather than adopting or adapting existing tools.

  • Find tools via Digital Research Tools (DiRT) wiki (which I founded. Bamboo DiRT now has a new home and provides better browsing and sharing capabilities, thanks to the hard work of the fabulous Quinn Dombrowski and Bamboo).
  • SHANTI’s UVa Knowledge Base offers useful information about technologies, teaching, and research approaches. (Aimed at the University of Virginia, but more widely applicable.)
  • You can also poke around GitHub, which hosts code, to identify tools under development by members of the digital humanities community such as CHNM and MITH.

NITLE Can Help

Let me end with a plug for NITLE (the National Institute for Technology in Liberal Education), my (relatively) new employer. One of the reasons I wanted to join the organization as the director of NITLE Labs is because I was impressed by its digital humanities initiative, which my colleague Rebecca Frost Davis leads. Among NITLE’s activities in the digital humanities:

If you’re a veteran digital humanist, how did you get started, and what do you wish you knew from the beginning? If you’re a newcomer, what do you want to know? What worries you, and what excites you? What did I leave out of this overview? I welcome comments on my blog.

Originally published by Lisa Spiro on October 14, 2011. Revised March 2012.

Articles

Academic History Writing and its Disconnects, by Tim Hitchcock

The last ten years have seen the development of what looks like a coherent format for the publication of inherited texts online – in particular, ‘books’. The project of putting billions of words of keyword searchable text is now nearing completion (at least in a Western context); and the hard intellectual work that went into this project is now done. We are within sight of that moment when all printed text produced between 1455 and 1923 (when US copyright provisions mean that the needs of modern corporations and IP owners outweigh those of simple scholarship), will be available online to search and to read. The vast majority of this digital text is currently configured to pretend to be made up of ‘books’ and other print artifacts. But, of course, it is not books. At some level it is just text – the difference between one book and the next is a single line of metadata. The hard leather covers that traditionally divided one group of words from another are gone; and while scholars continue to pretend to be reading books, even when seated comfortably in front of their office computer, this is a charade. Modern humanities scholarship is a direct engagement with a deracinated, Google-ised, Wikipedia-ised, electronic text.

For the historian, this development has two significant repercussions. First, the evolution of new forms of delivery and analysis of inherited text problematizes and historicizes the notion of the book as an object, and as a technology. And second, in the process problematizing the ‘book’, it also impacts the discipline of history as it is practiced in the digital present. Because history has been organised to be written from ‘books’, found in hard copy libraries, the transformation of books to texts forces us to question the methodologies of modern history writing.

In other words, the book as a technology for packaging and delivery, storing, and finding text is now redundant. The underpinning mechanics that determined its shape and form are as antiquated as moveable type. And in the process of moving beyond the book, we have also abandoned the whole post-enlightenment infrastructure of libraries and card catalogues (or even OPACS), of concordances, and indexes, and tables of contents. They are all built around the book, and the book is dead.

To many this will appear mere overstatement; just another apocalyptic pronouncement of radical change of the sort digital humanists specialize in. And there is no question but that ‘books’ will continue to be published for the foreseeable future. Just as manuscripts continued to be written through all the centuries of the book, so the hard copy volume will survive the development of the online and the digital. But, the transition is nevertheless important and transformational; and for a start allows us to interrogate the ‘history of the book’ in new ways.

First, it allows us to begin to escape the intellectual shackles that the book as a form of delivery imposed upon us. That chapters still tend to come out at a length just suited to a quire of paper, is a commonplace instance of a wider phenomenon. If we can escape the self-delusion that we are reading ‘books’, the development of the infinite archive, and the creation of a new technology of distribution allows us to move beyond the linear and episodic structures the book demands, to something different and more complex. It also allows us to more effectively view the book as an historical artifact and now redundant form of controlling technology. The ‘book’ is newly available for analysis.

The absence of books makes their study more important, more innovative, and more interesting. It also makes their study much more relevant to the present – a present in which we are confronted by a new, but equally controlling and limiting technology for transmitting ideas. By mentally escaping the ‘book’ as a normal form and format, scholars can see it more clearly for what it was. To this extent, the death of the book is a liberating thing – the fascist authority of the format is beaten.

At the same time we are confronted by a profound intellectual challenge that addresses the very nature of the historical discipline. This transition from the ‘book’ to something new fundamentally undercuts what historians do more generally. When one starts to unpick the nature of the historical discipline it is tied up with the technologies of the printed page and the book in ways that are powerful and determining. Footnotes, post-Rankean cross referencing, and the practises of textual analysis are embedded within the technology of the book, and its library.

Equally, the technology of authority – all the visual and textual clues that separate a Cambridge University Press monograph from the irresponsible musings of a know-nothing prose merchant – are slipping away. At the same time, the currency of professional identity – the titles, positions, and honorifics – built again on the supposedly secure foundations of book publishing – seems ever more debased. The question becomes:  is history, like the book – particularly in its post-Rankean, professional, and academic form – dead? Are we losing the distinctive disciplinary character that allows us to think beyond the surface and makes possible complex analyses that transcend mere cleverness and aspires to explanation?

On the face of it, the answer is yes – the renewed role of the popular blockbuster, and an ever growing and insecure emphasis on readership over scholarship, would suggest as much. In Britain, humanist scholars shy away from the metrics that would demonstrate the ‘impact’ of their work primarily from fear that it may not have any. A single and self-evident instance that evidences a deeper malaise is the failure to cite what we read. We read online journal articles, but cite the hard copy edition; we do keywords searches, while pretending to undertake immersive reading. We search ‘Google Books’, and pretend we are not.

But even more importantly, we ignore the critical impact of digitisation on our intellectual praxis. Only 48% of the significant words in the Burney collection of eighteenth-century newspapers are correctly transcribed as a result of poor OCR.[1] This makes the other 52% completely un-findable. And of course, from the perspective of the relationship between scholarship and sources, it is always the same 52%. Bill Turkel describes this as the Las Vegas effect – all bright lights, and an invitation to instant scholarly riches, but with no indication of the odds, and no exit signs. We use the Burney collection regardless – entirely failing to apply the kind of critical approach that historians have built their professional authority upon. This is roulette dressed up as scholarship.

In other words, historians and other humanists have abandoned the rigour of traditional scholarship. Provenance, edition, transcription, editorial practise, readership, authorship, reception – the things academics have traditionally queried in relation to books, are left unexplored in relation to the online text which now forms the basis of most published history.

As importantly, the way ‘history’ is promulgated has not kept up either. Why have historians failed to create television programmes with footnotes, and graphs with underlying spreadsheets and sliders? History is part of a grand conversation between the present and the past, played out in extended narrative and analysis, with structure, point, and purpose; but it will be increasingly impoverished if it continues to be produced as a ragged and impotent ghost of a fifteenth century technology. The book had a wonderful 1200 odd year history, which is certainly worth exploring. Its form self-evidently controlled and informed significant aspects of cultural and intellectual change in the West (and through the impositions of Empire, the rest of the world as well); but if historians are to avoid going the way of the book, they need to separate out what they think history is designed to achieve, and to create a scholarly technology that delivers it.

In a rather intemperate attack on the work of Jane Jacobs, published in 1962, Louis Mumford observed that:

… minds unduly fascinated by computers carefully confine themselves to asking only the kind of question that computers can answer and are completely negligent of the human contents or the human results.[2]

In the last couple of decades, historians who are unduly fascinated by books, have restricted themselves to asking only the kind of questions books can answer. Fifty years is a long time in computer science. It is time to find out if a critical and self-consciously scholarly engagement with computers might not now allow the ‘human contents’ of the past to be more effectively addressed.

A post-endum

This piece was adapted from the rough text of a short talk delivered to a symposium on ‘Future Directions in Book History’ held at Cambridge University on the 24th of November 2011.  It then had an extended afterlife both as a post on my own blog, Historyonics, and in the Open Peer Review section of Digitalhumanitiesnow.org in preparation for the Journal of Digital Humanities. I then revised it for re-publication in a post-peer review format. The comments were useful, and I am particularly grateful to John Levin, Adam Crymble, Alycia Sellie, Joe Grobelny, and Lisa Spiro for their willingness to engage critically with it. I have tried to incorporate some of their views within the text. But, I also wanted to take this opportunity to record my own feelings about the process.

The text was originally written in my normal ‘ranting’ voice, with all the freedom that implies to overstate and shock. The tone is perhaps slightly adolescent, but it is a style that works in the intimate atmosphere of an academic venue, and embeds all the pastiche rhythms and rhetorical ticks I have collected over thirty years of academic writing and lecturing. Its subsequent publication as a blog post was flagged as a text intended for personal, verbal presentation. First person pronouns were retained and the imagined gestures and pauses left to do their work. But in revising it for this post-peer review re-publication I found myself automatically changing it in to a different form, speaking in a different voice – more distant, more careful, more ‘academic’ for lack of a better word. I have also toned down some (though not all) of the overstatement and hyperbole.

This revision has been an enjoyable process, and I have particularly benefited from the direct engagement with the comments posted, but I am left with yet another conundrum. I like overstatement and hyperbole. I find them intellectually useful, and the form of an un-reviewed blog and ranty presentation gave me real freedom to indulge in them. The original text reflected all the joys of composing in high voice; and all the freedoms of being an unconstrained publisher of one’s own thoughts. In other words, as an author, I was gifted the joy of a blogger, and found that responding to peer review (open or otherwise) merely tarnished and dulled my own pleasure in the product.

Of course, prose is intended for an audience, and preferably an audience that extends beyond the author alone. But this experience makes me wonder if we need to rethink peer review even more fundamentally than the move from closed to open formulae implies. Perhaps we need to recognise that reconstructing a process of selection and revision (of re-creating the scholarly journal online with knobs on) achieves only half the objective. Perhaps we also need to recognise the value of the draft, and the talk; the prose written for an audience of one, and shared only because it can be. Perhaps we need to worry less about the forms and process of generating authority and get on with the work of engaging with a wider world of ideas.

As you will have guessed, I have suddenly moved into blog mode – and it is simply more fun than academic writing.

Originally published by Tim Hitchcock on October 23, 2011. Revised March 2012.

  1. [1] Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15, no. 7/8 (2009), http://www.dlib.org/dlib/july09/munoz/07munoz.html.
  2. [2] Lewis Mumford, ”The Sky Line ‘Mother Jacobs Home Remedies’,” The New Yorker (December 1, 1962), 148.

Defining Data for Humanists: Text, Artifact, Information or Evidence? by Trevor Owens

Data seems to be the word of the moment for scholarship. The National Endowment for the Humanities and a range of other funders are inviting scholars to “dig data” in their “Digging into Data” grant program. Data itself is now often discussed as representing a fourth paradigm for scientific discovery and scholarship (PDF). What is a humanist to do in such a situation? Does data, in particular big data, require humanists to adopt a new methodological paradigm? Or, are the kinds of questions humanities scholars traditionally have explored through close reading and hermeneutic interpretation relevant to big data? In this brief essay I suggest some of the ways that humanists already think about and analyze their sources can be employed to understand, explore, and question data.

What is Data to a Humanist?

We can choose to treat data as different kinds of things. First, as constructed things, data are a species of artifact. Second, as authored objects created for particular audiences, data can be interpreted as texts. Third, as computer-processable information, data can be computed in a whole host of ways to generate novel artifacts and texts which are then open to subsequent interpretation and analysis. Which brings us to evidence. Each of these approaches—data as text, artifact, and processable information—allow one to produce or uncover evidence that can support particular claims and arguments. Data is not in and of itself a kind of evidence but a multifaced object which can be mobilized as evidence in support of an argument.

Data as Constructed Artifacts

Data is always manufactured. It is created. More specifically, data sets are always, at least indirectly, created by people. In this sense, the idea of “raw data” is a bit misleading. The production of a data set requires choices about what and how to collect and how to encode the information. Each of those decisions offers a new potential point of analysis.

Now, when data is transformed into evidence, when we isolate or distill the features of a data set, or when we generate a visualization or present the results of a statistical procedure, we are not presenting the artifact. These are abstractions. The data itself has an artifactual quality to it. What one researcher considers noise, or something to be discounted in a dataset, may provide essential evidence for another.

In the sciences, there are some tacit and explicit agreements on acceptable assumptions and a set of statistical tests exist to help ensure the validity of interpretations. These kinds of statistical instruments are also great tools for humanists to use. They are not, however, the only way to look at data. For example, the most common use of statistics is to study a small sample in order to make generalizations about a larger population. But statistical tests intended to identify whether trends in small samples scale into larger populations are not useful if you want to explore the gritty details and peculiarities of a data set.

Data as Interpretable Texts

As a species of human-made artifact, we can think of data sets as having the same characteristics as texts. Data is created for an audience. Humanists can, and should interpret data as an authored work and the intentions of the author are worth consideration and exploration. At the same time, the audience of data also is relevant. Employing a reader-response theory approach to data would require attention to how a given set of data is actually used, understood, and interpreted by various audiences. That could well include audiences of other scientists, the general public, government officials, etc. When we consider what a data set means to individuals within a certain context, we open up a range of fruitful interpretive questions which the humanities are particularly well situated to explicate.

Data as Processable Information

Data can be processed by computers. We can visualize it. We can manipulate it. We can pivot and change our perspective on it. Doing so can help us see things differently. You can process data in a stats package like R and run a range of statistical tests to uncover statistically significant differences or surface patterns and relationships. Alternatively, you can deform a data set with a process like Spoonbill’s N+7 machine, which replaces every noun in a text with the seventh word in the dictionary that follows the original, thus prompting you to see the original data from a different perspective, as Mark Sample’s Hacking the Accident did for Hacking the Academy. In both cases, you can process information—numerical or textual—to change your frame of understanding for a particular set of data.

Importantly, the results of processed information are not necessarily declarative answers for humanists. If we take seriously Stephen Ramsay’s suggestions for algorithmic criticism, then data offers humanists the opportunity to manipulate or algorithmically derive or generate new artifacts, objects, and texts that we also can read and explore.[3] For humanists, the results of information processing are open to the same kinds of hermeneutic exploration and interpretation as the original data.

Data Can Hold Evidentiary Value

As a species of human artifact, as a cultural object, as a kind of text, and as processable information, data is open to a range of hermeneutic tactics for interpretation. In much the same way that encoding a text is an interpretive act, so are creating, manipulating, transferring, exploring, and otherwise making use of data sets.  Therefore, data is an artifact or a text that can hold the same potential evidentiary value as any other kind of artifact. That is, scholars can uncover information, facts, figures, perspectives, meanings, and traces of thoughts and ideas through the analysis, interpretation, exploration, and engagement with data, which in turn can be deployed as evidence to support all manner of claims and arguments. I contend that data is not a kind of evidence; it is a potential source of information that can hold evidentiary value.

Conclusion

Approaching data in this way should feel liberating to humanists.  For us, data and the capabilities of processing data are not so much new methodological paradigms, rather an opportunity for us to bring the skills we have honed in the close reading of texts and artifacts into service for this new species of text and artifact. Literary scholar Franco Moretti already has asked us to pivot, to begin to engage in distant reading. What should reassure us all is that at the end of the day, any attempt at distant reading results in a new artifact that we can also read closely.

In the end, the kinds of questions humanists ask about texts and artifacts are just as relevant to ask of data. While the new and exciting prospects of processing data offer humanists a range of exciting possibilities for research, humanistic approaches to the textual and artifactual qualities of data also have a considerable amount to offer to the interpretation of data.

Originally published by Trevor Owens on December 15, 2011. Revised March 2012.

  1. [3] Stephen Ramsay, Reading Machines: Toward an Algorithmic Criticism (Champaign: University of Illinois Press, 2011).

Demystifying Networks, Parts I & II, by Scott B. Weingart

Part 1 of n: An Introduction

This piece builds on a bunch of my recent blog posts that have mentioned networks. Elijah Meeks already has prepared a good introduction to network visualizations on his own blog, so I cover more of the conceptual issues here, hoping to reach people with little-to-no background in networks or math, and specifically to digital humanists interested in applying network analysis to their own work.

Some Warnings

A network is a fantastic tool in the digital humanist’s toolbox—one of many—and it’s no exaggeration to say pretty much any data can be studied via network analysis. With enough stretching and molding, you too could have a network analysis problem! As with many other science-derived methodologies, it’s fairly easy to extend the metaphor of network analysis into any number of domains.

The danger here is two-fold.

  1. When you’re given your first hammer, everything looks like a nail. Networks can be used on any project. Networks should be used on far fewer. Networks in the humanities are experiencing quite the awakening, and this is due in part to the until-recently untapped resources of easy tools and available datasets. There is a lot of low-hanging fruit out there on the networks+humanities tree, and they ought to be plucked by those brave and willing enough to do so. However, that does not give us an excuse to apply networks to everything. This series will talk a little bit about when hammers are useful, and when you really should be reaching for a screwdriver.
  2. Methodology appropriation is dangerous. Even when the people designing a methodology for some specific purpose get it right—and they rarely do—there is often a score of theoretical and philosophical caveats that get lost when the methodology gets translated. In the more frequent case, when those caveats are not known to begin with, “borrowing” the methodology becomes even more dangerous. Ted Underwood blogs a great example of why literary historians ought to skip a major step in Latent Semantic Analysis, because the purpose of the literary historian is so very different from that of the computer scientist who designed the algorithm. This series will attempt to point out some of the theoretical baggage and necessary assumptions of the various network methods it covers.

The Basics

Nothing worth discovering has ever been found in safe waters. Or rather, everything worth discovering in safe waters has already been discovered, so it’s time to shove off into the dangerous waters of methodology appropriation, cognizant of the warnings but not crippled by them.

Anyone with a lot of time and a vicious interest in networks should stop reading right now, and instead pick up copies of Networks, Crowds, and Markets[4] and Networks: An Introduction[5]. The first is a non-mathy introduction to most of the concepts of network analysis, and the second is a more in-depth (and formula-laden) exploration of those concepts. They’re phenomenal, essential, and worth every penny.

Those of you with slightly less time, but somehow enough to read my rambling blog (there are apparently a few of you out there), so good of you to join me. We’ll start with the really basic basics, but stay with me, because by part n of this series, we’ll be going over the really cool stuff only ninjas, Gandhi, and The Rolling Stones have worked on.

Networks

The word “network” originally meant just that: “a net-like arrangement of threads, wires, etc.” It later came to stand for any complex, interlocking system. Stuff and relationships.

A simple network representation from wikipedia.org

Generally, network studies are made under the assumption that neither the stuff nor the relationships are the whole story on their own. If you’re studying something with networks, odds are you’re doing so because you think the objects of your study are interdependent rather than independent. Representing information as a network implicitly suggests not only that connections matter, but that they are required to understand whatever’s going on.

Oh, I should mention that people often use the word “graph” when talking about networks. It’s basically the mathy term for a network, and its definition is a bit more formalized and concrete. Think dots connected with lines.

Because networks are studied by lots of different groups, there are lots of different words for pretty much the same concepts. I’ll explain some of them below.

The Stuff

Stuff (presumably) exists. Eggplants, true love, the Mary Celeste, tall people, and Terry Pratchett’s Thief of Time all fall in that category. Network analysis generally deals with one or a small handful of types of stuff, and then a multitude of examples of that type.

Say the type we’re dealing with is a book. While scholars might argue the exact lines of demarcation separating book from non-book, I think we can all agree that most of the stuff on my bookshelf are, in fact, books. They’re the stuff. There are different examples of books: a quotation dictionary, a Poe collection, and so forth.

I’ll call this assortment of stuff nodes. You’ll also hear them called vertices (mostly from the mathematicians and computer scientists), actors (from the sociologists), agents (from the modelers), or points (not really sure where this one comes from).

The type of stuff corresponds to the type of node. The individual examples are the nodes themselves. All of the nodes are books, and each book is a different node.

Nodes can have attributes. Each node, for example, may include the title, the number of pages, and the year of publication.

A list of nodes could look like this:

| Title                    | # of pages | year of publication |
| ----------------------------------------------------------- |
| Graphs, Maps, and Trees  | 119        | 2005                |
| How The Other Half Lives | 233        | 1890                |
| Modern Epic              | 272        | 1995                |
| Mythology                | 352        | 1942                |
| Macroanalysis            | unknown    | 2011                |

A network of books (nodes) with no relationships (connections)

We can get a bit more complicated and add more node types to the network. Authors, for example. Now we’ve got a network with books and authors (but nothing linking them, yet!). Franco Moretti and Graphs, Maps, and Trees are both nodes, although they are of different varieties, and not yet connected. We could have a second list of nodes, part of the same network, that might look like this:

| Author          | Birth | Death   |
| --------------------------------- |
| Franco Moretti  | ?     | n/a     |
| Jacob A. Riis   | 1849  | 1914    |
| Edith Hamilton  | 1867  | 1963    |
| Matthew Jockers | ?     | n/a     |

A network of books and authors without relationships.

A network with two types of nodes is called 2-modebimodal, or bipartite. We can add more, making it multimodal. Publishers, topics, you-name-it. We can even add seemingly unrelated node-types, like academic conferences, or colors of the rainbow. The list goes on. We would have a new list for each new variety of node.

Presumably we could continue adding nodes and node-types until we run out of stuff in the universe. This would be a bad idea, and not just because it would take more time, energy, and hard-drives than could ever possibly exist. As it stands now, network science is ill-equipped to deal with multimodal networks. 2-mode networks are difficult enough to work with, but once you get to three or more varieties of nodes, most algorithms used in network analysis simply do not work. It’s not that they can’t work; it’s just that most algorithms were only created to deal with networks with one variety of node. This is a trap I see many newcomers to network science falling into, especially in the digital humanities. They find themselves with a network dataset of, for example, authors and publishers. Each author is connected with one or several publishers (we’ll get into the connections themselves in the next section), and the up-and-coming network scientist loads the network into their favorite software and visualizes it. Woah! A network! Then, because the software is easy to use, and has a lot of buttons with words that from a non-technical standpoint seem to make a lot of sense, they press those buttons to see what comes out. Then, they change the visual characteristics of the network based on the buttons they’ve pressed. Let’s take a concrete example. Popular network software Gephi comes with a button that measures the centrality of nodes. Centrality is a pretty complicated concept that I’ll get into more detail later, but for now it’s enough to say that it does exactly what it sounds like: it finds how central, or important, each node is in a network. The newcomer to network analysis loads the author-publisher network into Gephi, finds the centrality of every node, and then makes the nodes bigger that have the highest centrality. The issue here is that, although the network loads into Gephi perfectly fine, and although the centrality algorithm runs smoothly, the resulting numbers do not mean what they usually mean. Centrality, as it exists in Gephi, was fine-tuned to be used with single mode networks, whereas the author-publisher network (not to mention the author-book network above) is bimodal. Centrality measures have been made for bimodal networks, but those algorithms are not included with Gephi. Most computer scientists working with networks do so with only one or a few types of nodes. Humanities scholars, on the other hand, are often dealing with the interactions of many types of things, and so the algorithms developed for traditional network studies are insufficient for the networks we often have. There are ways of fitting their algorithms to our networks, or vice-versa, but that requires fairly robust technical knowledge of the task at hand. Besides dealing with the single mode / multimodal issue, humanists also must struggle with fitting square pegs in round holes. Humanistic data are almost by definition uncertain, open to interpretation, flexible, and not easily definable. Node types are by definition concrete; your object either is or is not a book. Every book-type thing must share certain unchanging characteristics. This reduction of data comes at a price, one that some argue traditionally divided the humanities and social sciences. If humanists care more about the differences than the regularities, more about what makes an object unique rather than what makes it similar, that is the very information they are likely to lose by defining their objects as nodes. This is not to say it cannot be done, or even that it has not! People are clever, and network science is more flexible than some give it credit for. The important thing is either to be aware of what you are losing when you reduce your objects to one or a few types of nodes, or to change the methods of network science to fit your more complex data.

The Relationships

Relationships (presumably) exist. Friendships, similarities, web links, authorships, and wires all fall into this category. Network analysis generally deals with one or a small handful of types of relationships, and then a multitude of examples of that type. Now that we have stuff and relationships, we’re  equipped to represent everything needed for a simple network. Let’s start with a single mode network; that is, a network with only one sort of node: cities. We can create a network of which cities are connected to one another by at least one single stretch of highway, like the one below:

| City          | is connected to |
| ------------------------------- |
| Indianapolis  | Louisville      |
| Louisville    | Cincinnati      |
| Cincinatti    | Indianapolis    |
| Cincinatti    | Lexington       |
| Louisville    | Lexington       |
| Louisville    | Nashville       |

Cities interconnected by highways

The simple network above shows how certain cities are connected to one another via highways. A connection via a highways is the type of relationship. An example of one of the above relationships can be stated “Louisville is connected via a highway to Indianapolis.” These connections are symmetric because a connection from Louisville to Indianapolis also implies a connection in the reverse direction, from Indianapolis to Louisville. More on that shortly. First, let’s go back to the example of books and authors from the last section. Say the type we’re dealing with is an authorship. Books (the stuff) and authors (another kind of stuff) are connected to one-another via the authorship relationship, which is formalized in the phrase “X is an author of Y.” The individual relationships themselves are of the form “Franco Moretti is an author of Graphs, Maps, and Trees.” Much like the stuff (nodes), relationships enjoy a multitude of names. I’ll call them edges. You’ll also hear them called arcslinksties, and relations. For simplicity sake, although edges are often used to describe only one variety of relationship, I’ll use it for pretty much everything and just add qualifiers when discussing specific types. The type of relationship corresponds to the type of edge. The individual examples are the edges themselves. Individual edges are defined, in part, by the nodes that they connect. A list of edges could look like this:

| Person                   | Is an author of            |
| ----------------------------------------------------- |
| Franco Moretti           | Modern Epic                |
| Franco Moretti           | Graphs, Maps, and Trees    |
| Jacob A. Riis            | How The Other Half Lives   |
| Edith Hamilton           | Mythology                  |
| Matthew Jockers          | Macroanalysis              |

Network of books, authors, and relationships between them.

Notice how, in this scheme, edges can only link two different types of nodes. That is, a person can be an author of a book, but a book cannot be an author of a book, nor can a person an author of a person. For a network to be truly bimodal, it must be of this form. Edges can go between types, but not among them. This constraint may seem artificial, and in some sense it is, but for now the short explanation is that it is a constraint required by most algorithms that deal with bimodal networks. As mentioned above, algorithms are developed for specific purposes. Single mode networks are the ones with the most research done on them, but bimodal networks certainly come in a close second. They are networks with two types of nodes, and edges only going between those types. Contrast this against the single mode city-to-city network from before, where edges connected nodes of the same type. Of course, the world humanists care to model is often a good deal more complicated than that, and not only does it have multiple varieties of nodes – it also has multiple varieties of edges. Perhaps, in addition to “X is an author of Y” type relationships, we also want to include “A collaborates with B” type relationships. Because edges, like nodes, can have attributes, an edge list combining both might look like this.

| Node1                    | Node 2                     | Edge Type         |
| ----------------------------------------------------- | ----------------- |
| Franco Moretti           | Modern Epic                | is an author of   |
| Franco Moretti           | Graphs, Maps, and Trees    | is an author of   |
| Jacob A. Riis            | How The Other Half Lives   | is an author of   |
| Edith Hamilton           | Mythology                  | is an author of   |
| Matthew Jockers          | Macroanalysis              | is an author of   |
| Matthew Jockers          | Franco Moretti             | collaborates with |

Network of authors, books, authorship relationships, and collaboration relationships.

Notice that there are now two types of edges: “is an author of” and “collaborates with.” Not only are they two different types of edges; they act in two fundamentally different ways. “X is an author of Y” is an asymmetric relationship; that is, you cannot switch out Node1 for Node2. You cannot say “Modern Epic is an author of Franco Moretti.” We call this type of relationship a directed edge, and we generally represent that visually using an arrow going from one node to another.

“A collaborates with B,” on the other hand, is a symmetric relationship. We can switch out “Matthew Jockers collaborates with Franco Moretti” with “Franco Moretti collaborates with Matthew Jockers,” and the information represented would be exactly the same. This is called an undirected edge, and is usually represented visually by a simple line connecting two nodes. Notice that this is an edge connecting two nodes of the same type (an author-to-author connection), and recall that true bimodal networks require edges to only go between types. Algorithms meant for bimodal networks no longer apply to the network above.

Most network algorithms and visualizations break down when combining these two flavors of edges. Some algorithms were designed for directed edges, like Google’s PageRank, whereas other algorithms are designed for undirected edges, like many centrality measures. Combining both types is rarely a good idea. Some algorithms will still run when the two are combined, however the results usually make little sense.

Both directed and undirected edges can also be weighted. For example, I can try to make a network of books, with those books that are similar to one another sharing an edge between them. The more similar they are, the heavier the weight of that edge. I can say that every book is similar to every other on a scale from 1 to 100, and compare them by whether they use the same words. Two dictionaries would probably connect to one another with an edge weight of 95 or so, whereas Graphs, Maps, and Trees would probably share an edge of weight 5 with How The Other Half Lives. This is often visually represented by the thickness of the line connecting two nodes, although sometimes it is represented as color or length.

It’s also worth pointing out the difference between explicit and inferred edges. If we’re talking about computers connected on a network via wires, the edges connecting each computer actually exist. We can weight them by wire length, and that length, too, actually exists. Similarly, citation linkages, neighbor relationships, and phone calls are explicit edges.

We can begin to move into interpretation when we begin creating edges between books based on similarity (even when using something like word comparisons). The edges are a layer of interpretation not intrinsic in the objects themselves. The humanist might argue that all edges are intrinsic all the way down, or inferred all the way up, but in either case there is a difference in kind between two computers connected via wires, and two books connected because we feel they share similar topics.

As such, algorithms made to work on one may not work on the other; or perhaps they may, but their interpretative framework must change drastically. A very central computer might be one in which, if removed, the computers will no longer be able to interact with one another; a very central book may be something else entirely.

As with nodes, edges come with many theoretical shortcomings for the humanist. Really, everything is probably related to everything else in its light cone. If we’ve managed to make everything in the world a node, realistically we’d also have some sort of edge between pretty much everything, with a lesser or greater weight. A network of nodes where almost everything is connected to almost everything else is called dense, and dense networks are rarely useful. Most network algorithms (especially ones that detect communities of nodes) work better and faster when the network is sparse, when most nodes are only connected to a small percentage of other nodes.

Maximally dense networks from sagemath.org

To make our network sparse, we often must artificially cut off which edges to use, especially with humanistic and inferred data. That’s what Shawn Graham showed us how to do when combining topic models with networks. The network was one of authors and topics; which authors wrote about which topics? The data itself connected every author to every topic to a greater or lesser degree, but such a dense network would not be very useful, so Shawn limited the edges to the highest weighted connections between an author and a topic. The resulting network looked like this (PDF), when it otherwise would have looked like a big ball of spaghetti and meatballs.

Unfortunately, given that humanistic data are often uncertain and biased to begin with, every arbitrary act of data-cutting has the potential to add further uncertainty and bias to a point where the network no longer provides meaningful results. The ability to cut away just enough data to make the network manageable, but not enough to lose information, is as much an art as it is a science.

Hypergraphs & Multigraphs

Mathematicians and computer scientists have actually formalized more complex varieties of networks, and they call them hypergraphs and multigraphs. Because humanities data are often so rich and complex, it may be more appropriate to represent them using these representations. Unfortunately, although ample research has been done on both, most out-of-the-box tools support neither. We have to build them for ourselves.

A hypergraph is one in which more than two nodes can be connected by one edge. A simple example would be an “is a sibling of” relationship, where the edge connected three sisters rather than two. This is a symmetric, undirected edge, but perhaps there can be directed edges as well, of the type “Alex convinced Betty to run away from Carl.” A three-part edge.

A multigraph is one in which multiple edges can connect any two nodes. We can have, for example, a transportation graph between cities. A edge exists for every transportation route. Realistically, many routes can exist between any two cities: some by plane, several different highways, trains, etc.

I imagine both of these representations will be important for humanists going forward, but rather than relying on that computer scientist who keeps hanging out in the history department, we ourselves will have to develop algorithms that accurately capture exactly what it is we are looking for. We have a different set of problems, and though the solutions may be similar, they must be adapted to our needs.

Side note: RDF Triples

Digital humanities loves RDF (Resource Description Framework), which is essentially a method of storing and embedding structured data. RDF basically works using something called a triple; a subject, a predicate, and an object. “Moretti is an author of Graphs, Maps, and Trees” is an example of a triple, where “Moretti” is the subject, “is an author of” is the predicate, and “Graphs, Maps, and Trees” is the object. As such, nearly all RDF documents can be represented as a directed network. Whether that representation would actually be useful depends on the situation.

Side note: Perspectives

Context is key, especially in the humanities. One thing the last few decades has taught us is that perspectives are essential, and any model of humanity that does not take into account its multifaceted nature is doomed to be forever incomplete. According to Alex, his friends Betty and Carl are best friends. According to Carl, he can’t actually stand Betty. The structure and nature of a network might change depending on the perspective of a particular node, and I know of no model that captures this complexity. If you’re familiar with something that might capture this, or are working on it yourself, please let me know via e-mail.

Networks, Revisited

This piece has discussed the simplest units of networks: the stuff and the relationships that connect them. Any network analysis approach must subscribe to and live with that duality of objects. Humanists face problems from the outset: data that do not fit neatly into one category or the other, complex situations that ought not be reduced, and methods that were developed with different purposes in mind. However, network analysis remains a viable methodology for answering and raising humanistic questions—we simply must be cautious, and must be willing to get our hands dirty editing the algorithms to suit our needs.

Part II: Node Degree: An Introduction

In Part II, I will cover the deceptively simple concept of node degree. I say “deceptive” because, on the one hand, network degree can tell you quite a lot. On the other hand, degree can often lead one astray, especially as networks become larger and more complicated.

A node’s degree is, simply, how many edges it is connected to. Generally, this also correlates to how many neighbors a node has, where a node’s neighborhood is those other nodes connected directly to it by an edge. In the network below, each node is labeled by its degree.

Each node in the network is labeled with its degree, from wikipedia.org

If you take a minute to study the network, something might strike you as odd. The bottom-right node, with degree 5, is connected to only four distinct edges, and really only three other nodes (four, including itself). Self-loops, which will be discussed later, are counted twice. A self-loop is any edge which starts and ends at the same node.

Why are self-loops counted twice? Well, as a rule of thumb you can say that, since the degree is the number of times the node is connected to an edge, and a self-loop connects to a node twice, that’s the reason. There are some more math-y reasons dealing with matrix representation, another topic for a later date. Suffice it to say that many network algorithms will not work well if self-loops are only counted once.

The odd node out on the bottom left, with degree zero, is called an isolate. An isolate is any node with no edges.

At any rate, the concept is clearly simple enough. Count the number of times a node is connected to an edge, get the degree. If only getting higher education degrees were this easy.

Centrality

Node degree is occasionally called degree centralityCentrality is generally used to determine how important nodes are in a network, and lots of clever researchers have come up with lots of clever ways to measure it. “Importance” can mean a lot of things. In social networks, centrality can be the amount of influence or power someone has; in the U.S. electrical grid network, centrality might mean which power station should be removed to cause the most damage to the network.

The simplest way of measuring node importance is to just look at its degree. This centrality measurement at once seems deeply intuitive and extremely silly. If we’re looking at the social network of Facebook, with every person a node connected by an edge to their friends, it’s no surprise that the most well-connected person is probably also the most powerful and influential in the social space. On the same token, though, degree centrality is such a coarse-grained measurement that it’s really anybody’s guess what exactly it’s measuring. It could mean someone has a lot of power; it could also mean that someone tried to become friends with absolutely everybody on Facebook. Recall the example of a city-to-city network from Part I of this series: Louisville was the most central city because you have to drive through it to get to the most others.

Degree Centrality Sampling Warnings

Degree works best as a measure of network centrality when you have full knowledge of the network. That is, a social network exists, and instead of getting some glimpse of it and analyzing just that, you have the entire context of the social network: all the friends, all the friends of friends, and so forth.

When you have an ego-network (a network of one person, like a list of all my friends and who among them are friends with one another), clearly the node with the highest centrality is the ego node itself. This knowledge tells you very little about whether that ego is actually central within the larger network, because you sampled the network such that the ego is necessarily the most central. Sampling strategies—how you pick which nodes and edges to collect—can fundamentally affect centrality scores. The city-to-city network from Part I has Louisville as the most central city, however a simple look at a map of the United Staes would show that, given more data, this would no longer be the case.

An ego network from wikipedia.org

A historian of science might generate a correspondence network from early modern letters currently held in Oxford’s library. In fact, this is currently happening, and the resulting resource will be invaluable. Unfortunately, centrality scores generated from nodes in that early modern letter writing network will more accurately reflect the whims of Oxford editors and collectors over the years, rather than the underlying correspondence network itself. Oxford scholars over the years selected certain collections of letters, be they from Great People or sent to or from Oxford, and that choice of what to hold at Oxford libraries will bias centrality scores toward Oxford-based scholars, Great People, and whatever else was selected for.

Similarly, the generation of a social network from a literary work will bias the recurring characters; characters that occur more frequently are simply statistically more likely to appear with more people, and as such will have the highest degrees. It is likely that the degree centrality and frequency of character occurrence are almost exactly correlated.

Of course, if what you’re looking for is the most central character in the novel or the most central figure from Oxford’s perspective, this measurement might be perfectly sufficient. The important thing is to be aware of the limitations of degree centrality, and the possible biasing effects from selection and sampling. Once those biases are explicit, careful and useful inferences can still be drawn.

Things get a bit more complicated when looking at document similarity networks. If you’ve got a network of books with edges connecting them based on whether they share similar topics or keywords, your degree centrality score will mean something very different. In this case, centrality could mean the most general book. Keep in mind that book length might affect these measurements as well; the longer a book is, the more likely (by chance alone) it will cover more topics. Thus, longer books may also appear to be more central, if one is not careful in generating the network.

Degree Centrality in Bimodal Networks

Recall that bimodal networks are ones where there are two different types of nodes (e.g., articles and authors), and edges are relationships that bridge those types (e.g., authorships). In this example, the more articles an author has published, the more central she is. Degree centrality would have nothing to do, in this case, with the number of co-authorships, the position in the social network, etc.

With an even more multimodal network, having many types of nodes, degree centrality becomes even less well defined. As the sorts of things a node can connect to increases, the utility of simply counting the number of connections a node has decreases.

Micro vs. Macro

Looking at the degree of an individual node, and comparing it against others in the network, is useful for finding out about the relative position of that node within the network. Looking at the degree of every node at once turns out to be exceptionally useful for talking about the network as a whole, and comparing it to others. I’ll leave a thorough discussion of degree distributions for a later post, but it’s worth mentioning them in brief here. The degree distribution shows how many nodes have how many edges.

As it happens, many real world networks exhibit something called “power-law properties” in their degree distributions. What this essentially means is that a small number of nodes have an exceptionally high degree, whereas most nodes have very low degrees. By comparing the degree distributions of two networks, it is possible to say whether they are structurally similar. There’s been some fantastic work comparing the degree distribution of social networks in various plays and novels to find if they are written or structured similarly.

Extending Degree

For the entirety of this piece, I have been talking about networks that were unweighted and undirected. Every edge counted just as much as every other, and they were all symmetric (a connection from A to B implies the same connection from B to A). Degree can be extended to both weighted and directed (asymmetric) networks with relative ease.

Combining degree with edge weights is often called strength. The strength of a node is the sum of the weights of its edges. For example, let’s say Steve is part of a weighted social network. The first time he interacts with someone, an edge is created to connect the two with a weight of 1. Every subsequent interaction incrementally increases the weight by 1, so if he’s interacted with Sally four times, Samantha two times, and Salvador six times, the edge weights between them are 4, 2, and 6 respectively.

In the above example, because Steve is connected to three people, his degree is 1+1+1=3. Because he is connected to one of them four times, another twice, and another six times, his weight is 4+2+6=8.

Combining degree with directed edges is also quite simple. Instead of one degree score, every node now has two different degrees: in-degree and out-degree. The in-degree is the number of edges pointing to a node, and the out-degree is the number of edges pointing away from it. If Steve borrowed money from Sally, and lent money to Samantha and Salvador, his in-degree might be and his out-degree 2.

Powerful Degrees

The degree of a node is really very simple: more connections, higher degree. However, this simple metric accounts for quite a great deal in network science. Many algorithms that analyze both node-level properties and network-level properties are closely correlated with degree and degree distribution. This is a pareto-like effect; a great deal about a network is driven by the degree of its nodes.

While degree-based results are often intuitive, it is worth pointing out that the prime importance of degree is a direct result of the binary network representation of nodes and edges. Interactions either happen or they don’t, and everything that is is a self-contained node or edge. Thus, how many nodes, how many edges, and which nodes have which edges will be the driving force of any network analysis. This is both a limitation and a strength; basic counts influence so much, yet they are apparently powerful enough to yield intuitive, interesting, and ultimately useful results.

Originally published by Scott Weingart on December 14, 2011 and December 17, 2011. Revised March 2012.


I plan to continue blogging about network analysis, so if you have any requests, please feel free to get in touch with me at scbweing at indiana dot edu.

  1. [4] David Easley and Jon M. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (Cambridge: Cambridge University Press,  2010).
  2. [5] Mark E. J. Newman, Networks: An Introduction, 1st ed (Oxford: Oxford University Press, 2010).

Clustering with Compression for the Historian, by Chad Black

INTRODUCTION

I mentioned in my blog that I’m playing around with a variety of clustering techniques to identify patterns in legal records from the early modern Spanish Empire. In this post, I will discuss the first of my training experiments using Normalized Compression Distance (NCD). I’ll look at what NCD is, some potential problems with the method, and then the results from using NCD to analyze the Criminales Series descriptions of the Archivo Nacional del Ecuador’s (ANE) Series Guide. For what it’s worth, this is a very easy and approachable method for measuring similarity between documents and requires almost no programming chops. So, it’s perfect for me!

WHAT IS NCD?

I was inspired to look at NCD for clustering by a pair of posts by Bill Turkel (herehere) from quite a few years ago. Bill and Stéfan Sinclair also used NCD to cluster cases for the Digging Into Data Old Bailey Project. Turkel’s posts provide a nice overview of the method, which was proposed in 2005 by Rudi Cilibrasi and Paul Vitányi.[6] Essentially, Cilibrasi and Vitányi proposed measuring the distance between two strings of arbitrary length by comparing the sum of the lengths of the individually compressed files to a compressed concatenation of the two files. So, adding the compressed length of x to the compressed length of y will be longer than the compressed length of (x|y). How much longer is what is important. The formula is this, where c(x) is the length of x compressed:

NCD(x,y) = [C(x|y) - min{C(x),C(y)}] / max{C(x),C(y)

C(x|y) is the compression of the concatenated strings. Theoretically, if you concatenated and compressed two identical strings, you would get a distance of 0 because [(Cx|x) - C(x)]/C(x) would equal 0/1, or 0. As we’ll see in a bit, though, this isn’t the case and the overhead required by the various compression algorithms at our disposal make a 0 impossible, and more so for long strings depending on the method. Cilibrasi and Vitányi note that in practice, that if r is the NCD, the NCD will be 0 ≤ r ≤ 1+ ∊, where  is usually around 0.1, and accounts for the implementation details of the compression algorithm. Suffice to say, though, that the closer to 0 the result is, the more similar the strings (or files in our case) are. Nonetheless, the distance between two strings, or files, or objects as measured with this formula can then be used to cluster those strings, files, or objects. One obvious advantage to the method is that it works for comparing strings of arbitrary length with one another.

Why does this work? Essentially, lossless compression suppresses redundancy in a string, while maintaining the ability to fully restore the file. Compression algorithms evolved to deal with constraints in the storage and transmission of data. It’s easy to forget in the age of the inexpensive terabyte hard drive what persistent storage once cost. In 1994, the year that the first edition of Witten, Moffat, and Bell’s Managing Gigabytes was published, hard disk storage still ran at close to $1/megabyte. That’s right, just 17 years ago that 500GB drive in your laptop would have cost $500,000. To put that into perspective, in 1980 IBM produced one of the first disk drives to break the GB barrier. The 2.52GB IBM 3380 was initially released in 5 different models, and ranged in price between $81,000 and and $142,000. For what it’s worth, the median housing price in Washington, DC in 1980 was the second highest in the country at $62,000. A hard disk that cost twice as much as the median house in DC. Obviously not a consumer product. At the per/GB rate that the 3380 sold for, your 500GB drive would have cost up to $28,174,603.17! In inflation-adjusted dollars for 2011 that would be $80.5M! An absurd comparison, to be sure. Given those constraints, efficiency in data compression made real dollars sense. Even still, despite the plunging costs of storage and growing bandwidth capacity, text and image compression remains an imperative in computer science.

As Witten, et al. define it,

Text compression … involves changing the representation of a file so that it takes less space to store or less time to transmit, yet the original file can be reconstructed exactly from the compressed representation.[7]

This is lossless compression (as opposed to lossy compression, which you may know from messing with jpegs or other image formats). There are a variety of compression methods, each of which takes a different approach to compressing text data and which are either individually or in some kind of combination behind the compression formats you’re used to–.zip, .bz2, .rar, .gz, etc. Frequently, they also have their roots in the early days of electronic data. Huffman coding was developed by an eponymous MIT graduate student in the early 1950s.

In any case, the objective of a compression method is to locate, remove, store, and recover redundancies within a text. NCD works because within a particular algorithm, the compression method is consistently imposed on the data, thus making the output comparable. What isn’t comparable, though, is mixing algorithms.

LIMITATIONS: SIZE MATTERS

Without getting too technical (mostly because I get lost once it goes too far), it’s worth noting some limitations based on which method of compression you chose when applying NCD. Shortly after Cilibrasi and Vitányi published their paper on clustering via compression, Cebrián, et al. published a piece that compared the integrity of NCD between three compressors– bzip2, gzip, and PPMZ.[8] The paper is interesting, in part, because the authors do an excellent job of explaining the mechanics of the various compressors in language that even I could understand.

I came across this paper through some google-fu because I was confused by the initial results I was getting while playing around with my Criminales Series Guide. Python has built-in support for compression and decompression using bzip2 and gzip, so that’s what I was using. I have the Criminales Series divided into decades from 1601 to 1830. My script was walking through and comparing every file in the directory to every other one, including itself. I assumed that the concatenation of two files that were identical would produce a distance measurement of 0, and was surprised to see that it wasn’t happening, and in some cases not even close. (I also hadn’t read much of anything about compression at that point!) But that wasn’t the most surprising thing. What was more surprising was that in the latter decades of my corpus, the distance measures when comparing individual decades to themselves were actually coming out very high. Or, at least they were using the gzip algorithm. For example, the decade with the largest number of cases, and thus the longest text, is 1781-1790 at about 39,000 words. Gzip returned an NCD of 0.97458 when comparing this decade to itself. What? How is that possible?

Cebrián, et al. explain how different compression methods have upper limits to the size of a block of text that they operate on before needing to break that block into new blocks. This makes little difference from the perspective of compressors doing their job, but it does have implications for clustering. The article goes into more detail, but here’s a quick and dirty overview.

bzip2

The bzip2 compressor works in three stages to compress a string: (1) a Burrows-Wheeler Transform, (2) a move-to-front transform, and (3) a statistical compressor like Huffman coding.[9] The bzip2 algorithm can perform this method on blocks of text up 900KB without needing to break the block of text into two blocks. So, for NCD purposes, this means that if a pair of files are concatenated, and the size of this pair is less than 900KB, what the bzip compressor will see is essentially a mirrored text. But, if the concatenated file is larger than 900KB, then bzip will break the concatenation into more than one block, each of which will be sent through the three stages of compression. But, these blocks will no longer be mirrors. As a result, the NCD will cease to be robust. Cebrián, et al. claim that the NCD for C(x|x) should fall in a range between 0.2 and 0.3, and anything beyond that indicates it’s not a good choice for comparing the set of documents under evaluation.

gzip

The gzip compressor uses a different method than bzip2′s block compression, one based on the Lempel-Ziv LZ77 algorithm, also known as sliding window compression. Gzip then takes the LZ77-processed string and subjects it to a statistical encoding like Huffman. It’s the first step that is important for us, though. Sliding window compression searches for redundancies by taking 32KB blocks of data, and looking ahead at the next 32KB of data. The method is much faster than bzip2′s block method. (In my experiments using python’s zlib module, code execution took about 1/2 the time as python’s bzip on default settings.) And, if the text is small, such that C(x|x) < 32KB, the NCD result is better. Cebrián, et al. find that gzip returns an NCD result in the range between 0 and 0.1. But, beyond 32KB they find that NCD rapidly grows beyond 0.9 — exactly what I saw with the large 1781-1790 file (which is 231KB).

lzma

Cebrián, et al. offer a third compressor, ppmz, as an alternative to bzip2 and gzip for files that outsize gzip and bzip2′s upper limits. Ppmz uses Prediction by Partial Match for compression, and has no upper limit on effective file size. PPM is a statistical model that uses arithmetic coding. This gets us to things I don’t really understand, and certainly can’t explain here. Suffice to say that the authors found using ppmz that C(x|x) always returned an NCD value between 0 and 0.1043. I looked around for quite a while and couldn’t find a python implementation of ppmz, but I did find another method ported to python with lzma, the compressor behind 7zip. Lzma uses a different implementation of Lempel-Ziv, utilizing a dictionary instead of a sliding window to track redundancies. What is more, the compression-dictionary can be as large as 4GB. You’d need a really, really large document to brush up against that. Though Cebrián, et al. didn’t test lzma, my experiments show the NCD of C(x|x) to be between 0.002 and 0.02! That’s awfully close to 0, and the smallest return actually came from the longest document –> 1781-1790.

THE CODE

In a way, that previous section is getting ahead of myself. I started with just zlib, and then added bzip2 and gzip, and eventually lzma for comparison sake. Let me clarify that just a bit. In python, there are two modules that use the gzip compressor:

  1. gzip, which is for file compression/decompression; and
  2. zlib, which is for compressing/decompressing strings or objects.

I was unsettled by my early zlib returns, and tried using gzip and file I/O, but got the same returns. Initially I was interested in speed, but reading Cebrián, et al. changed my mind on that. Nonetheless, I did time the functions to see which was fastest.

I based the script on Bill Turkel’s back from 2007. (Bill put all of the scripts from the days of Digital History Hacks on Github. Thanks to him for doing that!)

So, for each compressor we need a function to perform the NCD algorithm on a pair of files:

# Function to calculate the NCD of two files using lzma
def ncd_lzma(filex, filey):
    xbytes = open(filex, 'r').read()
    ybytes = open(filey, 'r').read()
    xybytes = xbytes + ybytes
    cx = lzma.compress(xbytes)
    cy = lzma.compress(ybytes)
    cxy = lzma.compress(xybytes)
    if len(cy) > len(cx):
        n = (len(cxy) - len(cx)) / float(len(cy))
    else:
        n = (len(cxy) - len (cy)) / float(len(cx))
    return n

There are small changes depending on the API of the compressor module, but this pretty much sums it up.

We need to be able to list all the files in our target directory, but ignore any dot-files like .DS_Store that creep in on OS X or source control files if you’re managing your docs with git or svn or something:

# list directory ignoring dot files
def mylistdir(directory):
    filelist = os.listdir(directory)
    return [x for x in filelist
            if not (x.startswith('.'))]

Just as an aside here, let me encourage you to put your files under source control, especially as you can accidentally damage them while developing your scripts.

We need a function to walk that list of files, and perform NCD on every possible pairing, the results of which are written to a file. For this function, we pass as arguments the file list, the results file, and the compressor function of choice:

def walkFileList(filelist, outfile, compType):
    for i in range(0, len(filelist)-1):
        print i
        for j in filelist:
            fx = pathstring+str(filelist[i])
            fy = pathstring+str(j)
            outx = str(filelist[i])
            outy = str(j)
            outfile.write(str(outx[:-4]+"  "+outy[:-4]+"  ")+str(compType(fx, fy))+"\n")

That’s all you need. I mentioned also that I wanted to compare execution time for the different compressors. That’s easy to do with a module from the python standard library called profile, which can return a bunch of information gathered from the execution of your script at runtime. To call a function with profile you simply pass the function to profile.run as a string. So, to perform NCD via lzma as described above, you just need something like this:

outfile = open('_lzma-ncd.txt', 'w')
print "Starting lzma NCD."
profile.run('walkFileList(filelist, outfile, ncd_lzma)')
print 'lzma finished.'
outfile.close()

I put the print statements in just for shits and giggles. Because we ran this through profile, after doing the NCD analysis and writing it to a file named _lzma-ncd.txt, python reports on the total number of function calls, the time per call, per function, and cumulative for the script. It’s useful for identifying bottlenecks in your code if you get to the point of optimizing. At any rate, there is no question that lzma is much slower that the others, but if you have the cpu cycles available, it may be worth the rate from a quality of data perspective. Here’s what profile tells us for the various methods:

  • zlib: 7222 function calls in 16.564 CPU seconds (compressing string objects)
  • gzip: 69460 function calls in 18.377 CPU seconds (compressing file objects)
  • bzip: 7222 function calls in 21.129 CPU seconds
  • lzma: 7222 function calls in 115.678 CPU seconds

If you expected zlib/gzip to be substantially faster than bzip, it was, until I set all of the algorithms to the highest available level of compression. I’m not sure that’s necessary or not, but it does affect the results as well as time. Note too that the gzip file method requires many more function calls, but with relatively little performance penalty.

COMPARING RESULTS

The Series Guide

A little bit more about the documents I’m trying to cluster. Beginning around 2002, the Archivo Nacional del Ecuador began to produce pdfs of their ever-growing list of Series Finders guides. The Criminales Series Guide (big pdf) was a large endeavor. The staff went through every folder in every box in the series, reorganized them, and wrote descriptions for the Series Guide. Entries in the guide are divided by box and folder (caja/expediente). A typical folder description looks like this:

Expediente: 6
Lugar: Quito
Fecha: 30 de junio de 1636
No. de folios : 5
Contenido: Querella criminal iniciada por doña Joana Requejo, mujer legítima del escribano mayor Andrés de Sevilla contra Pedro Serrano, por haber entrado a su casa y por las amenazas que profirió contra ella con el pretexto de que escondía a una persona que él buscaba.

We have the place (Quito), the date (06/30/1636), the number of pages (5), and a description. The simple description includes the name of the plaintiff, in this case Joana Requejo, and the defendant, Pedro Serrano, along with the central accusation– that Serrano had entered her house and threatened her under the pretext that she was hiding a person he was looking for. There is a wealth of information that can be extracted from that text. The Series Guides as a whole is big, constituting close to 875 pages of text and some 1.1M words. I currently have text files for the following Series Guides–> Criminales, Diezmos, Encomiendas, Esclavos, Estancos, Gobierno, Haciendas, Indígenas, Matrimoniales, Minas, Obrajes, and Oficios totaling 4.8M words. I’ll do some comparisons between the guides in the near future, and see if we can identify patterns across Series. For now, though, it’s just the Criminales striking my fancy.

The Eighteenth Century

So, what does the script give us for the 18th century? Below are the NCD results for three different compressors comparing my decade of interest, 1781-1790, with the other decades of the 18th century:

zlib:

cr1781_1790  cr1701_1710  0.982798401771
cr1781_1790  cr1711_1720  0.987881971149
cr1781_1790  cr1721_1730  0.977414695455
cr1781_1790  cr1731_1740  0.97668311167
cr1781_1790  cr1741_1750  0.975895252209
cr1781_1790  cr1751_1760  0.975088634189
cr1781_1790  cr1761_1770  0.975632632389
cr1781_1790  cr1771_1780  0.973381605357
cr1781_1790 cr1781_1790 0.974582153107 
cr1781_1790  cr1791_1800  0.972256091842
cr1781_1790  cr1801_1810  0.973325329682

bzip:

cr1781_1790  cr1701_1710  0.954733848029
cr1781_1790  cr1711_1720  0.96900988758
cr1781_1790  cr1721_1730  0.929649194095
cr1781_1790  cr1731_1740  0.923066504131
cr1781_1790  cr1741_1750  0.906271163484
cr1781_1790  cr1751_1760  0.903237166463
cr1781_1790  cr1761_1770  0.902912095354
cr1781_1790  cr1771_1780  0.849356630096
cr1781_1790 cr1781_1790 0.287823378031
cr1781_1790  cr1791_1800  0.850331843424
cr1781_1790  cr1801_1810  0.850358932683

lzma:

cr1781_1790  cr1701_1710  0.965529663402
cr1781_1790  cr1711_1720  0.976516942474
cr1781_1790  cr1721_1730  0.947607790161
cr1781_1790  cr1731_1740  0.94510863447
cr1781_1790  cr1741_1750  0.931757289204
cr1781_1790  cr1751_1760  0.931757289204
cr1781_1790  cr1761_1770  0.92759202972
cr1781_1790  cr1771_1780  0.885106382979
cr1781_1790 cr1781_1790 0.0021839468648
cr1781_1790  cr1791_1800  0.880670944501
cr1781_1790  cr1801_1810  0.887110210514

First off, even just eyeballing it, you can see that the results from bzip and lzma are more reliable and follow exactly the patterns discussed by Cebrián, et al. The bzip run provides a C(x|x) of 0.288, which falls in the acceptable range. The lzma run returns a C(x|x) NCD of 0.0022, not much more needed to say there. And, as I noted above, with zlib/gzip we get 0.9745. Further, by eyeballing the results on the good runs, two relative clusters appear in the decades surrounding 1781-1790. It appears that from 1771 to 1810 that we have more similarity than in the earlier decades of the century. This accords with my expectations based on other research, and in both cases the further back from 1781 that you go, the more different the decades are on a trendline.

If we change the comparison node to, say, 1741-1750 we get the following results:

bzip:

cr1741_1750  cr1701_1710  0.888048411498
cr1741_1750  cr1711_1720  0.919398218188
cr1741_1750 cr1721_1730 0.826189275508
cr1741_1750 cr1731_1740 0.80795091612
cr1741_1750  cr1741_1750  0.277693730039
cr1741_1750 cr1751_1760 0.785168132862
cr1741_1750 cr1761_1770 0.803655071796
cr1741_1750  cr1771_1780  0.879983993015
cr1741_1750  cr1781_1790  0.906271163484
cr1741_1750  cr1791_1800  0.883904391852
cr1741_1750  cr1801_1810  0.886378259718

lzma:

cr1741_1750  cr1701_1710  0.905551014342
cr1741_1750  cr1711_1720  0.932600133759
cr1741_1750  cr1721_1730  0.862079215278
cr1741_1750 cr1731_1740 0.848926209408
cr1741_1750  cr1741_1750  0.00587055064279
cr1741_1750 cr1751_1760 0.830746598014
cr1741_1750 cr1761_1770 0.844162055066
cr1741_1750  cr1771_1780  0.90796460177
cr1741_1750  cr1781_1790  0.929573342339
cr1741_1750  cr1791_1800  0.908149721264
cr1741_1750  cr1801_1810  0.913968518045

Again, the C(x|x) show reliable data. But, this time bzip’s similarities look a fair amount different that lzma when eyeballing it. I’m interested in the decade of the 1740s in part because I expect more similarity to the latter decades than for other decades in, really, either the 18th or the 17th century. I expect this for reasons that have to do with other types of hermeneutical screwing around, to use Stephen Ramsey’s excellent phrase [PDF], that I’ve been doing with the records lately. Chief among those (and an argument for close as well as distant readings) is that I’ve been transcribing weekly jail censuses from the 1740s the past week and some patterns of familiarity have been jumping out at me. I have weekly jail counts from 1732 to 1791 inclusive, and a bunch others too. I’ve transcribed so many of these things that I have pattern expectations. And, the 1740s has jumped out at me for three reasons this week. The first is that in 1741, after a decade of rarely noting it, the notaries started to record the reason for ones detention. The second is that in 1742, and particularly under the aegis of one particular magistrate, more people started to get arrested than previous and subsequent decades. The third is that, like in the period between 1760 and 1790, those arrests were increasingly for moral offenses or for being picked up during nightly rounds of the city (the ronda). The differences are this–in the latter period women and men were arrested in almost equal numbers. There are almost no women detainees in the 1740s. And, there doesn’t seem to be an equal growth in both detentions and prosecutions in the 1740s. This makes the decade more like the 1760s than the 1780s. The results above bear that out to some extent, as distance measures show to be more like the 1760s than the 1780s.

I also had this suspicion because a few months ago I plotted occurrences of the terms concubinato (illicit co-habitation) and muerte (used in murder descriptions) from the Guide:

Occurrences of the terms "concubinato" and "muerte" from the Criminales Series Guide.

Occurrences of the terms "concubinato" and "muerte" from the Criminales Series Guide.

You should see that right at the decade of the 1740s there is a discernible, if smaller, bump for concubinato. I was reminded of this when transcribing the records.

CONCLUSION

OK, at this point, this post is probably long enough. What’s missing above is obviously visualizations of the clusters. Those visualizations are pretty interesting. For now, though, let me conclude by saying that I am impressed initially to see the clusters that emerged from this simple, if profound, technique for clustering. Given that the distinctions I’m trying to pick up are slight, I’m worried a bit about the level of precision I can expect. But, I am convinced that it’s worth sacrificing performance for either bzip or lzma implementations depending on the length of one’s documents. Unless your files are longer than 900KB, it’s probably worth just sticking with bzip.

Originally published by Chad Black on October 9, 2011. Revised March 2012.

  1. [6] Rudi Cilibrasi and Paul Vitányi, “Clustering by Compression,” IEEE Transactions on Information Theory 51.4 (2005): 1523-45, PDF.
  2. [7] Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd Edition (San Diego, California: Academic Press, 1999), 21.
  3. [8] Manuel Cebrián, Manuel Alfonseca, and Alfonso Ortega, “Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in A Compressor,” Communications of Information and Systems 5, no. 4 (2005): 367-384.
  4. [9] Cebrián, et al., “Common Pitfalls Using the Normalized Compression Distance,” 372.

Spatializing Photographic Archives, by Marc Downie & Paul Kaiser

The extensive and carefully illustrated White Paper for our NEH-sponsored “Spatializing Photographic Archives” project can be downloaded as a large PDF (26.5mb).

The White Paper describes the open-source software tool we’ve developed, and our reasons for wanting to forge a new approach to making digital tools for scholars. It also examines the implications of our approach for photography. After examining the history of landscape photography in the American West, we show how by stepping outside the photographic frame and unfreezing a photograph’s frozen instant, we can reveal many hidden aspects of photography and create new kinds of works.

Our first case study investigates the Richard Misrach’s canonical Desert Cantos series, which proved to be a difficult but exceptionally rewarding test case. In October 2009, we worked with Misrach at two of the original sites for the Desert Cantos.

Bombay Beach Reconstruction

Reconstruction of the ruins at Bombay Beach.

At the first site, we reconstructed the ruins at the once-flooded edge of Bombay Beach on the Salton Sea in southern California, where there remained enough landmarks for us to match our spatial reconstruction of the site to Misrach’s original photos.

Palm trees from Richard Misrach’s "Desert Cantos" series.

Spatializing palm trees from Richard Misrach’s "Desert Cantos" series.

At the second site we spatialized a stand of palm trees that was the subject of several of his Desert Fires photographs.

Misrach’s photographs of a bulldozer

Spatializing Misrach’s photographs of a bulldozer near Bombay Beach.

We also reconstructed the process of one of Misrach’s works in progress, spatializing his attempts to photograph a decrepit bulldozer at the edge of the Salton Sea. We track his path over the time of his shoot and his framing of the subject.

Spatialization of boats approaching the shoreline of Okinawa, Japan.

Spatialization of boats approaching the shoreline of Okinawa, Japan in 1945.

The second case study examines battlefield photographs of Okinawa, 1945; the third prototypes a simple pipeline for scholars by which they make a 3D capture of an object using just the video capabilities of a smartphone and a laptop computer.

Finally, the paper presents two hypothetical projects that our approach would underpin. These would create new kinds of interdisciplinary works that tie photo reconstruction to extensive data-mining, and would blur boundaries between the arts, humanities, and sciences.

Originally published by the OpenEndedGroup in December 2011.

"Humanities in a Digital Age" Symposium Podcasts, including Jeremy Boggs, Alison Booth, Daniel J. Cohen, Mitchell S. Green, Ann Houston, and Stephen Ramsay

On November 11th, the University of Virginia’s Institute of the Humanities and Global Cultures hosted a daylong symposium on “The Humanities in a Digital Age.” The symposium included two panels—one on Access & Ownership and the other on Research & Teaching—and two keynote talks.

The first keynote was given by Stephen Ramsay, Associate Professor in the Department of English and Fellow in the Center for Digital Research in the Humanities at the University of Nebraska–Lincoln.

The second keynote was given by Dan Cohen, Associate Professor in the Department of History and Director of the Roy Rosenzweig Center for History in New Media at George Mason University.

Jeremy Boggs and Ann Houston, “Access and Ownership.”

Stephen Ramsay, “Textual Behavior in the Human Male.”

Could not use HTML 5 or Flash for playback. You can download the file as MPEG4/H.264 or Ogg Theora file.

Alison Booth and Mitch Green, “Research and Teaching.”

Dan Cohen, “Humanities Scholars and the Web: Past, Present, and Future,” with response by Jerome McGann.

Originally published by the Scholars’ Lab on December 13, 2011. Keynote by Stephen Ramsay revised March 2012 and available for download (video, PDF).

Philosophical Leadership Needed for the Future: Digital Humanities Scholars in Museums, including Nik Honeysett and Michael Edson

Nik Honeysett, Head of Administration for the J. Paul Getty Museum.