Technology

The Simple Truth

The Plains to Peaks Collective uses Combine as its aggregation tool, bringing together the varied and customized metadata formats of the cultural heritage institutions throughout the states of Colorado and Wyoming. Combine is the the creation of the Michigan Service Hub, and we thank them heartily for its creation, and their willingness to support and share it with us and other DPLA Service Hubs. Thanks Michigan Hub!

What is Combine?

Combine is an application to facilitate the harvesting, transformation, analysis, and publishing of metadata records by Service Hubs for inclusion in the Digital Public Library of America (DPLA).

The name “Combine”, pronounced /kämˌbīn/ with a long i, is a nod to the combine harvester used in farming used for, “combining three separate harvesting operations – reaping, threshing, and winnowing – into a single process” Instead of grains, we have metadata records! These metadata records may come in a variety of metadata formats, various states of transformation, and may or may not be valid in the context of a particular data model. Like the combine equipment used for farming, this application is designed to provide a single point of interaction for multiple steps along the way of harvesting, transforming, and analyzing metadata in preparation for inclusion in DPLA.

The Process:

An illustration showing the Combine harvest and output process.
Partners provide us with records in a variety of formats, from OAI endpoints, APIs, spreadsheets, JSON or XML files, etc. As long as the records are, or can be converted into, valid XML, we can harvest them into Combine. The data schema is irrelevant–as long as it’s XML, Combine will harvest it.

During the harvest process, Combine gives us tools to analyze the records to make sure they meet DPLA minimum requirements, and to help us map data to the appropriate fields for the DPLA.

Once we have the records in Combine, and have mapped the data, we transform the records via XSLT or Python into a standardized format. We choose MODS for our data, but Combine can output any XML data scheme you’d like.

We then publish the transformed records to Combine’s OAI endpoint, from whence the DPLA harvests them for their ingests.

The Technical Stuff

Installation:

Combine has a fair amount of server components, dependencies, and configurations that must be in place to work, as it leverages Apache Spark, among other applications, for processing on the back end.

To this end, a separate GitHub repository, Combine-playbook, has been created to assist with provisioning a server with everything necessary, and in place, to run Combine. This repository provides routes for server provisioning via Vagrant and/or Ansible. Please visit the Combine-playbook repository for more information about installation.

Where to Find out More:

If you just want to kick the tires, the QuickStart guide provides a walk through of harvesting, transforming, and publishing some records, that lays the groundwork for more advanced analysis.