Text Mining Tools in the Humanities: An Analysis Framework

, , , , , and

Poster

Download (PDF, Unknown)

Abstract[1]

The most extensive compendium of text mining tools to date includes 71 tools and summarizes each based on ten criteria.[2] While extensive, this listing of tools and their properties is general in its review criteria and does not offer any testing-based observations to help users assess actual usability. Humanists looking to try text analysis, visualization, and mining tools for research need better information that is relevant to their needs and reviews of tools that help them make choices. This poster presents the testing framework developed for the TAPoR 2.0 portal reviews. The poster will cover:

  1. The need for tool reviews
  2. The information gathered about tools
  3. The testing and reviewing process
  4. Conclusions about the state of text tools

The poster will be accompanied by a demonstration of TAPoR 2.0 so that users can see the reviews in context.

1. The Need for Tool Reviews

A humanities researcher new to computing methods and looking for reviews of text tools by peers on the internet is going to be disappointed. There is nothing like The New York Review of Books for tools, though in the early days of humanities computing you could find short announcements about tools in journals like Computing in the Humanities. We, however, believe that certain text tools are intellectual contributions to the field that should be reviewed not just to help people choose what tools to use, but also as a way of engaging these tools in a dialogue around computer-assisted interpretation.[3] While there are individual blog entries about tools scattered across the web, each is from the perspective of a single user with an entirely different dataset, making comparison difficult. If we want to make computing methods accessible and encourage colleagues to use tools, we need a more systematic approach. This is especially true of text mining tools that can’t simply be tried with a text at hand.

2. Information Gathered about Tools

TAPoR 2.0 is a portal for text analysis, visualization, and mining tool discovery and review. TAPoR 2.0 is a complete redevelopment of the original TAPoR portal that has focused the portal on discovery and review instead of trying to provide access only to web services.[4] As part of the redevelopment of TAPoR 2.0 we used a persona/scenario usability design approach to identify attributes by which users might want to discover tools.[5] Further, we built TAPoR 2.0 so that editors can add new attributes without the database having to be reprogrammed. Some of the attributes we currently record for tools include the author(s), ease of use, type of analysis, type of license, and so on. We also provide links to related tools. Our poster will be accompanied by a demonstration of TAPoR 2.0 so that visitors can explore what we have and how we represent it.

Figure 1: TAPoR 2.0 Home Screen

Figure 1: TAPoR 2.0 Home Screen

3. The Testing and Reviewing Process

Recording basic information about tools is not enough, especially for sophisticated text mining tools like Mallet that take time to learn and that can be used in different ways. With text mining tools users need longer narrative reviews. For this reason we developed processes for testing and reviewing tools. For simpler text analysis and visualization tools this involved developing a set of different texts with which to test tools so we could compare their use. For text mining we had to go further and are working with the CWRC project (Canadian Writing Research Collaboratory) to develop a number of literary corpora with experts we can draw on to help assess the value of results. As of this writing, we have three corpora drawn from the Orlando Project and one of Victorian children’s literature. We expect to have two more by the time of presentation. The poster will discuss the criteria used to develop these open test corpora.

The reviews take the form of comments that have been pinned to the top of the list of comments available. This allows others to leave comments, though we haven’t seen much activity by people not connected to the project (with the exception of spammers who seem to feel there is a connection between text analysis tools and various stimulants). We have developed guidelines for reviews so as to make them accessible and comparable. The poster will outline our guidelines.

4. Conclusions from Testing and Reviewing

Having tested and reviewed a variety of tools and text mining systems, we see some common barriers to access. Most of these tools have been developed for use by the developers and are poorly documented for people not involved in the development. Further, many tools, including those we are involved in, are in continuous development, resulting in documentation that is out of date. We will therefore end this poster with lessons learned while testing and reviewing text mining tools, with particular attention to removing usability barriers for novice users.

Originally presented by John Simpson, Geoffrey Rockwell, Stefan Sinclair, Kirsten Uszkalo, Susan Brown, Amy Dyrbye, and Ryan Chartier at DH2013 on July 17, 2013.

  1. [1]The authors would like to thank and acknowledge support from both the INKE Research Group and the Text Mining & Visualization Project, funded by the Social Sciences and Humanities Research Council of Canada.
  2. [2]van Gemert, Jan. “Text Mining Tools on the Internet.” ISIS Technical Report Series, 2000. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4312886.
  3. [3]Ramsay, S. and G. Rockwell. “Developing Things: Notes toward an Epistemology of Building in the Digital Humanities.” In Debates in the Digital Humanities, edited by M. K. Gold, 75-84. Minneapolis, Minnesota: University of Minnesota Press, 2012.
  4. [4]Rockwell, Geoffrey. “TAPoR: Building a Portal for Text Analysis.” Mind Technologies: Humanities Computing and the Canadian Academic Community, edited by Raymond Siemens and David Moorman, 285-299. Calgary: University of Calgary Press, 2006.
  5. [5]Cooper, Alan. The Inmates Are Running the Asylum. Indianapolis, Indiana: SAMS, 2004.

About John Simpson, Geoffrey Rockwell, Ryan Chartier, Stéfan Sinclair, Susan Brown, Amy Dyrbye, and Kirsten Uszkalo

John Simpson is a postdoctoral fellow at the University of Alberta, splitting time between two projects: INKE and Text Mining & Visualization for Digital Literary History. He teaches, codes, and carries on research in the digital humanities in topics related to visualization, text mining, the semantic web, programming, gaming, and philosophy of science and philosophy of computing.

Dr. Geoffrey Martin Rockwell is a Professor of Philosophy and Humanities Computing at the University of Alberta, Canada. He received a B.A. in philosophy from Haverford College, an M.A. and Ph.D. in Philosophy from the University of Toronto and worked at the University of Toronto as a Senior Instructional Technology Specialist. From 1994 to 2008 he was at McMaster University where he was the Director of the Humanities Media and Computing Centre (1994 - 2004) and he led the development of an undergraduate Multimedia program funded through the Ontario Access To Opportunities Program. He has published and presented papers in the area of philosophical dialogue, textual visualization and analysis, humanities computing, instructional technology, computer games and multimedia. He is the project leader for the CFI (Canada Foundation for Innovation) funded project TAPoR, a Text Analysis Portal for Research, which has developed a text tool portal for researchers who work with electronic texts and he organized a SSHRC funded conference, The Face of Text in 2004. He has published a book Defining Dialogue: From Socrates to the Internet with Humanity Books.

University of Alberta

Stéfan Sinclair is an Associate Professor of Digital Humanities at McGill University. His primary area of research is in the design, development, usage and theorization of tools for the digital humanities, especially for text analysis and visualization.

Susan Brown is Professor of English at the University of Guelph and Visiting Professor at the University of Alberta. Her research interests include the digital humanities, Victorian literature, and women’s writing. All of these interests inform Orlando: Women’s Writing in the British Isles from the Beginnings to the Present, an ongoing experiment in digital literary history published online by Cambridge University Press since 2006 that she directs and co-edits. She leads the Canadian Writing Research Collaboratory infrastructure project.

University of Alberta

University of Alberta