Text Mining Tools in the Humanities: An Analysis Framework
John Simpson, Geoffrey Rockwell, Ryan Chartier, Stéfan Sinclair, Susan Brown, Amy Dyrbye and Kirsten Uszkalo
Poster
Abstract[1]
The most extensive compendium of text mining tools to date includes 71 tools and summarizes each based on ten criteria.[2] While extensive, this listing of tools and their properties is general in its review criteria and does not offer any testing-based observations to help users assess actual usability. Humanists looking to try text analysis, visualization, and mining tools for research need better information that is relevant to their needs and reviews of tools that help them make choices. This poster presents the testing framework developed for the TAPoR 2.0 portal reviews. The poster will cover:
- The need for tool reviews
- The information gathered about tools
- The testing and reviewing process
- Conclusions about the state of text tools
The poster will be accompanied by a demonstration of TAPoR 2.0 so that users can see the reviews in context.
1. The Need for Tool Reviews
A humanities researcher new to computing methods and looking for reviews of text tools by peers on the internet is going to be disappointed. There is nothing like The New York Review of Books for tools, though in the early days of humanities computing you could find short announcements about tools in journals like Computing in the Humanities. We, however, believe that certain text tools are intellectual contributions to the field that should be reviewed not just to help people choose what tools to use, but also as a way of engaging these tools in a dialogue around computer-assisted interpretation.[3] While there are individual blog entries about tools scattered across the web, each is from the perspective of a single user with an entirely different dataset, making comparison difficult. If we want to make computing methods accessible and encourage colleagues to use tools, we need a more systematic approach. This is especially true of text mining tools that can’t simply be tried with a text at hand.
2. Information Gathered about Tools
TAPoR 2.0 is a portal for text analysis, visualization, and mining tool discovery and review. TAPoR 2.0 is a complete redevelopment of the original TAPoR portal that has focused the portal on discovery and review instead of trying to provide access only to web services.[4] As part of the redevelopment of TAPoR 2.0 we used a persona/scenario usability design approach to identify attributes by which users might want to discover tools.[5] Further, we built TAPoR 2.0 so that editors can add new attributes without the database having to be reprogrammed. Some of the attributes we currently record for tools include the author(s), ease of use, type of analysis, type of license, and so on. We also provide links to related tools. Our poster will be accompanied by a demonstration of TAPoR 2.0 so that visitors can explore what we have and how we represent it.
3. The Testing and Reviewing Process
Recording basic information about tools is not enough, especially for sophisticated text mining tools like Mallet that take time to learn and that can be used in different ways. With text mining tools users need longer narrative reviews. For this reason we developed processes for testing and reviewing tools. For simpler text analysis and visualization tools this involved developing a set of different texts with which to test tools so we could compare their use. For text mining we had to go further and are working with the CWRC project (Canadian Writing Research Collaboratory) to develop a number of literary corpora with experts we can draw on to help assess the value of results. As of this writing, we have three corpora drawn from the Orlando Project and one of Victorian children’s literature. We expect to have two more by the time of presentation. The poster will discuss the criteria used to develop these open test corpora.
The reviews take the form of comments that have been pinned to the top of the list of comments available. This allows others to leave comments, though we haven’t seen much activity by people not connected to the project (with the exception of spammers who seem to feel there is a connection between text analysis tools and various stimulants). We have developed guidelines for reviews so as to make them accessible and comparable. The poster will outline our guidelines.
4. Conclusions from Testing and Reviewing
Having tested and reviewed a variety of tools and text mining systems, we see some common barriers to access. Most of these tools have been developed for use by the developers and are poorly documented for people not involved in the development. Further, many tools, including those we are involved in, are in continuous development, resulting in documentation that is out of date. We will therefore end this poster with lessons learned while testing and reviewing text mining tools, with particular attention to removing usability barriers for novice users.
Originally presented by John Simpson, Geoffrey Rockwell, Stefan Sinclair, Kirsten Uszkalo, Susan Brown, Amy Dyrbye, and Ryan Chartier at DH2013 on July 17, 2013.
- [1]The authors would like to thank and acknowledge support from both the INKE Research Group and the Text Mining & Visualization Project, funded by the Social Sciences and Humanities Research Council of Canada.↩
- [2]van Gemert, Jan. “Text Mining Tools on the Internet.” ISIS Technical Report Series, 2000. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4312886.↩
- [3]Ramsay, S. and G. Rockwell. “Developing Things: Notes toward an Epistemology of Building in the Digital Humanities.” In Debates in the Digital Humanities, edited by M. K. Gold, 75-84. Minneapolis, Minnesota: University of Minnesota Press, 2012.↩
- [4]Rockwell, Geoffrey. “TAPoR: Building a Portal for Text Analysis.” Mind Technologies: Humanities Computing and the Canadian Academic Community, edited by Raymond Siemens and David Moorman, 285-299. Calgary: University of Calgary Press, 2006.↩
- [5]Cooper, Alan. The Inmates Are Running the Asylum. Indianapolis, Indiana: SAMS, 2004.↩