Improving Large Digital Collection Analysis through Information Visualization


Published: November 12, 2013 by Dr. Michelle Chen

Michelle ChenI am working on an interesting project that I’d like to share with you and if possible gain some feedback and comments. The project is about improving large digital collection analysis through a new information visualization model for better access and retrieval.

Digital collection management is the process of managing and organizing the collection of digital information from various sources. With the advances of storage and archive technologies, the growth of communication and collaboration platforms, and the growing interest in sharing data in response to the Obama Big Data Initiative, there has been a tremendously increasing number in the volume and variety of digital documents available. For example, Harvard University has made public the information on more than 12 million digital objects inside its 73 libraries, each of which comes with more than 100 attributes (Hardy, 2004). The fact that the scale of digital collections has become huge and that the contents are usually from a collaborative effort has posed new challenges and research opportunities to the library and information science community. For example, from an academic perspective, the growing number of publications that are combined with increasingly cross-disciplinary sources has made it challenging for scholars to follow emerging research fronts and identify key publications (Dunn et al., 2012). Among the many collection management issues, how to create a model that allows for a better retrieval and archival result and a streamlined user experience for large-scale digital collection has been an important topic and serves as the main motivation of my research.

My proposed study will focus on developing a new information visualization model in an effort to provide a platform for users (both general users and librarians) to better archive and retrieve large-scale digital documents. Information visualization is the creation of 2-dimensional or 3-dimensional representations of data that enables new discoveries of insights and knowledge. The way information visualization utilizes the power of human vision to process large amount of information in a timely and parallel manner makes it one of the perfect methods to deal with large digital collections. In my research, the information visualization model will extract the hidden topics from digital documents at a semantic level and further visualize the topics to reveal document relations and similarities. The model will be tested with Illinois Digital Archives to provide practical evidence of its effectiveness and be qualitatively evaluated with a user survey for experience feedback.

This project is expected to fill the research gap between information visualizations at syntactic and semantic levels. With the new model, users will be able to retrieve and access digital documents through not only explicit information such as author, title, subject, etc. but also more tacit knowledge from hidden topics. This study will improve user experience in interacting with digital collections and open up a research direction that focuses on semantic visualization. Ultimately, this proposed study will be one of the output efforts of SJSU Cybersecurity / Big Data Cluster.


Dunn, C., Shnelderman, B., Gove, R., Klavans, J., & Dorr, B. (2012). Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351-2369.

Hardy, Q. (4/24/2012). Harvard releases big data for books. The New York Times. Retrieved from


Post new comment