Architecture to Enable Large-Scale Computational Analysis of Millions of Volumes

, , , , , , , , , and

Poster

Download (PDF, 1.5MB)

Abstract

The HathiTrust Research Center (HTRC) is a collaborative research center that provides Digital Humanities researchers access to not only millions of volumes from the HathiTrust (HT) digital library, but also cutting-edge software tools and cyber infrastructure to perform advanced computational analysis over the corpus at an unprecedented scale.

The corpus at the HTRC currently consists of over 3 million public domain volumes and anticipates access to an additional 6 million in-copyright volumes. In their raw form at the HathiTrust, these volumes are stored as files on special hardware using an internal Pairtree structure. The internal HathiTrust structure is optimal for its primary function of the digital page image delivery to digital library patrons for viewing; however, it does not support well the large-scale computational analysis which is the primary function of the HTRC. Navigating the Pairtree and uncompressing the text data would encounter major performance and scalability issues. While researchers from other scientific communities have been addressing aspects of the “Big Data” problem with success, the large corpus that HTRC hosts to support computational analysis presents a unique setting in that it consists of a massive number of small text-based files, whereas most solutions from the scientific communities are tailored towards large files and non-text-based content. In this poster, we will present the approach the HTRC takes to solve this problem — the HTRC keeps the Pairtree only for the purpose of synchronization with the HT, and processes and pushes the volume data from the local Pairtree to a NoSQL storage cluster using Apache Cassandra hosted on conventional hardware during the ingest process. In order to balance the data store and ingest workload, the developers at the HTRC and the HT also devised a very simple yet effective way to parallelize the rsync of the single source Pairtree at the HT on all Cassandra nodes by starting rsync at lower branches instead of at the root.

The use of a NoSQL cluster adds more complexity to the architecture than traditional file systems, but such complexity is transparent to the Digital Humanities researchers as most of the HTRC components with which user algorithms have interaction are RESTful web services, such as the Data API for accessing the data. The HTRC uses Blacklight, an open source bibliographic search and display interface, backed by a Solr index, to let users search for volumes for analysis and create collections. To apply analytical techniques to the data, a user may choose from a number of provided algorithms from the web portal, including SEASR/Meandre flows. In addition, the HTRC is actively researching and developing a secure computation environment (dubbed the Sloan Cloud) to support large-scale non-consumptive research over copyrighted volumes, and an experimental release is scheduled for end of March. This Sloan Cloud will allow researchers to deploy their own analysis algorithms against a corpus like the HT data, and to save intermediate data for later reuse, as well as to include custom worksets for the computation. We will present our early findings of the experimental Sloan Cloud and hope to get feedback from the digital humanities research community.

Originally presented by Stacy Kowalczyk, Yiming Sun, Beth Plale, J. Stephen Downie, Loretta Auvil, Boris Capitanu, Kirk Hess, Zong Peng, Guangchen Ruan, Aaron Todd, and Jiaan Zeng at DH2013 on July 17, 2013.

About Yiming Sun, Stacy Kowalczyk, Beth Plale, J. Stephen Downie, Loretta Auvil, Boris Capitanu, Kirk Hess, Zong Peng, Guangchen Ruan, Aaron Todd, and Jiaan Zeng

Yiming Sun is a Senior Software Architect of the HathiTrust Research Center project and works as a full-time staff at the Data to Insight Center of the Indiana University Pervasive Technology Institute. He has a M.S. degree of Computer Science from the School of Informatics and Computing at Indiana University and is currently a PhD candidate in the same school. His research interest is in the long-term preservation of scientific datasets.

Stacy T. Kowalczyk is an Assistant Professor in the Graduate School of Library and Information Science at Dominican University. Her research focuses on the problems of research data, big data, and curation, specifically looking at the intersection of social and technical issues. In her current work, she is investigating the research practices of scholars, the lifecycle of research data including data reuse, and the antecedents, barriers, and threats to preservation of research data.

Beth A. Plale is a professor of Computer Science as well as the Director of the Data to Insight Center of the Pervasive Technologies Institute and the Director of the Center for Data and Search Informatics at Indiana University. Plale is the Indiana Co-Director of the HathiTrust Research Center (HTRC). She has broad research and governance interest in long term preservation and access to scientific data, and enabling computational access to large-scale data for broader groups of researchers.

J. Stephen Downie is Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign. Downie is the Illinois Co-Director of the HathiTrust Research Center (HTRC). He is also Director of the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) and founder and ongoing director of the Music Information Retrieval Evaluation eXchange (MIREX).

Loretta Auvil works at the Illinois Informatics Institute (I3) at the University of Illinois at Urbana Champaign. She received a MS in Computer Science from Virginia Tech and a BS in Applied Mathematics and Computer Science from Alderson-Broaddus College. She has worked with a diverse set of application drivers to integrate machine learning and information visualization techniques to solve the needs of research partners.

Boris Capitanu is a Research Programmer at the Illinois Informatics Institute of the University of Illinois at Urbana-Champaign. His research interests include large scale data analysis, machine learning, data mining, and educational technologies. Boris holds a B.S. and M.S. in Computer Science from University of Illinois at Urbana-Champaign, and a MBA+MHRIR from the same institution.

Kirk Hess is the Digital Humanities Specialist at the University Library at the University of Illinois Urbana-Champaign. He holds a MS in Library and Information Science from the University of Illinois Urbana-Champaign and a BA in History from Carleton College. His current research with HTRC focuses on enhancing discovery of the Solr index using a custom implementation of the Blacklight digital library platform.

Zong Peng is a Ph.D. candidate in Computer Science in the School of Informatics and Computing at Indiana University Bloomington. He is a Research Assistant in Data to Insight Center of the Indiana University Pervasive Technology Institute, under supervision of Professor Plale. His current research with the HTRC focuses mainly on the building, deploying and maintaining the index infrastructure for the entire HTRC collection providing security, auditing, and customized functions.

Guangchen Ruan is currently working toward the Ph.D. degree in Computer Science in the School of Informatics and Computing at Indiana University Bloomington. He is a Research Assistant in Data to Insight Center of the Indiana University Pervasive Technology Institute, under supervision of Professor Plale. His research interests include big data analysis and management, distributed data intensive computing, data visualization, and data mining.

Aaron Todd is a Ph.D. candidate in School of Informatics and Computing at Indiana University Bloomington. He is a Research Assistant in Data to Insight Center of the Indiana University Pervasive Technology Institute, under supervision of Professor Plale. His research interests include concurrency, parallelism, compilers, parallel runtime systems, logic programming, type systems and effect systems.

Jiaan Zeng is a Ph.D. candidate in School of Informatics and Computing at Indiana University Bloomington. He is a Research Assistant in Data to Insight Center of the Indiana University Pervasive Technology Institute, under supervision of Professor Plale. His research focuses on large scale distributed system and data management, specially looking at how to run data analysis on large data set.