Collections as Data

The UC San Diego Library’s Digital Collections website is presently home to over 120,000 research data sets, images, documents, video and audio recordings. While the datasets make scientific data open-source and discoverable by researchers, the Library also works with primary source materials, such as prints, manuscripts, audio and visual recordings and a wealth of other physical materials that are not typically considered data in their physical form.

Recently, researchers and scholars have begun to view the information related to physical collections as a source of research data. One tool being developed by the San Diego Supercomputing Center is called SuAVE—it allows end-users to organize items according to a wide variety of data points, chosen by the user. 

Currently, the Digital Collections website includes standard searching and sorting capabilities, however, SuAVE’s provides a more visual display, displaying entire collections on a single page. Viewers can select topics, dates or geographic regions to narrow their search. They can also select more than one topic (e.g. Festivals and Dance) and have the items displayed together. 

SuAVE users can also choose from three different types of display for their search: a standard grid view, a column view and a cross-tab view, which allows items to be sorted across two values. The data from the items, combined with the interest of the user, dictates the manner in which the contents are displayed.

While SuAVE helps find patterns and organize the data about visual resources, text-based collections are also open to data mining. As an example, former Digital Scholarship Librarian Erin Glass utilized the software to analyze 3,100 copies of the UCSD Guardian published across a period of 50 years, all of which exist in our Digital Collections. Glass explored word frequencies and topics in the newspapers and found that the word “he” is mentioned significantly more than the word “she” in the collection, suggesting that men were more frequently discussed in or interviewed for the paper than women. Through data mining this digitized collection of newspapers, a pattern formed that would have been far more difficult and time consuming to achieve without data analysis.

In short, tools like SuAVE give researchers the ability to more easily detect patterns in text-based collections, providing enhanced ways to interact and engage with digital collections.