Skip to content

Indexing EAD in ArcLight

Jessie Keck edited this page Sep 17, 2019 · 26 revisions

Now that you have your ArcLight application up and running, we need to index data into it.

EAD requirements

Currently, ArcLight's indexer expects the following:

  • Valid and well-formed EAD 2002 according to its XSD schema. If we can't parse the finding aid, we can't index it. (Indexing DTD-compliant EAD 2002 might work, but we can't guarantee it.)
  • All components have at least a <unittitle/> or <unitdate/>. Without either, we won't be able to display anything!

EAD recommendations

  • Components should all have unique IDs applied to them. These IDs are used as "slugs" for the identifiers of the documents in Arclight. Maintaining these identifiers allows an EAD to be updated and re-indexed while maintaining the URL that the component resides at (retaining any user bookmarks, etc). We will mint IDs for components that do not have them, but this is done using the location of the component w/i the hierarchy of the EAD. This means if components are moved around, the metadata that resides at a given URL may change in unexpected ways. See Customizing behavior of indexing components w/o IDs below for more info.

Download sample EAD

First we need to download or access our EAD's. Let's create a directory where we can store these within our application.

mkdir eads

Now let's add some data there.

# This command will save one of our test datasets to the directory you just created
wget -P eads/ https://raw.githubusercontent.com/sul-dlss/arclight/master/spec/fixtures/ead/nlm/alphaomegaalpha.xml

Repository configuration

Next we need to run our indexing task and tell the task which "Repository" the EAD file is linked to. By default, your ArcLight application should have a file config/repositories.yml that was generated. This file contains information about the repositories for your instance. For example, in the EAD alphaomegaalpha.xml we want to link it to the first repository in that file, nlm:

nlm:
  name: 'National Library of Medicine. History of Medicine Division'
  description: 'NLM’s History of Medicine Division collects, preserves, makes available, and interprets for diverse audiences one of the world’s richest collections of historical material related to human health and disease.'
  building: 'Building 38, Room 1E-21'
  address1: '8600 Rockville Pike'
  address2: ''
  city: 'Bethesda'
  state: 'MD'
  zip: '20894'
  country: 'USA'
  phone: ''
  contact_info: '[email protected]'
  thumbnail_url: "https://collections.nlm.nih.gov/pageturnerserver/ajaxp?theurl=http://localhost:8080/fedora/get/nlm:nlmuid-101421040-img/THUMB"
  google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
  google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

We recommend that your config/repositories.yml contain only the repositories for which you have EADs to index.

Configuring a repository for Google Form Requests

ArcLight Repositories can be configured to enable items to be requestable through Google Forms. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml under the request_types key:

  • request_url - this url is the url to the user facing version of your request form
  • request_mappings - this string represents an encoded form field mapping for your custom form fields and ArcLight. The configurable ArcLight fields are:
    • collection_name
    • collection_creator
    • eadid
    • containers

To get the Google Form field identifiers, use the "pre-filled" form to get a crafted url with a similar format to the request_mappings format. See Google Forms support for more information.

An example of a correctly configured form looks like this:

  request_types:
    google_form:
      request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
      request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

Configuring a repository for Aeon Web EAD requests

ArcLight Repositories can be configured to enable items to be requestable through Aeon Web EAD requests. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml at the request_types key:

  • request_url - this url is the url of the Aeon instance which will handle the request
  • request_mappings - this string represents an encoded query params mapping for your request and ArcLight. This can contain a method name which is to be used as the EAD url.

An example of a correctly configured form looks like this:

  request_types:
    aeon_web_ead:
      request_url: 'https://sample.request.com'
      request_mappings: "Action=10&Form=31&Value=ead_url"

Indexing a single file

We can now use the arclight:index task in ArcLight to index our EAD.

FILE=./eads/alphaomegaalpha.xml REPOSITORY_ID=nlm bundle exec rake arclight:index
Loading ./eads/alphaomegaalpha.xml into index...
Indexed ./eads/alphaomegaalpha.xml (in 0.837 secs).

Adding more finding aids and repositories

You can add new repositories to the config/repositories.yml file. The key that begins a repository is the same value you will use as the REPOSITORY_ID in the indexing rake task.

We recommend that you organize EADs by repository and put them all in a directory using the repository's key. Then, run the rake arclight:index_dir using the DIR and REPOSITORY_ID environment variables to index files all to the same repository:

# this assumes there's a directory with EAD files called /tmp/sul-spec, and a repository configured with the ID "spec"
DIR=/tmp/sul-spec REPOSITORY_ID=sul-spec bundle exec rake arclight:index_dir

Configuring Downloads for Collections

We use the config/downloads.yml file for configuration of how we provide download links to resources that can be generated from metadata indexed into the collection (e.g. PDF and EAD links). Accessors from the SolrDocument class can be interpolated using the ruby string formatting %{method_name} when using the template key (instead of the href key). This allows an Arclight implementer to use existing accessors to interpolate values or create their own to do any sort of custom URL generation that they would like (note that non-URL values will be URL escaped).

There is a default configuration that you can use to configure behavior for all collections.

default:
  pdf:
    template: http://example.com/%{unitid}.pdf

Collection specific behavior can be configured using the <unitid>. For example, if you have a Collection with the <unitid> of "MS C 271", you would provide links to the downloads and their sizes like so (note this is not using interpolation so a plain href key can be provided):

MS C 271:
  pdf:
    href: 'http://example.com/MS+C+271.pdf'
    size: '1.23MB'
  ead:
    href: 'http://example.com/MS+C+271.xml'
    size: 123456

If you need to remove links to a specific collection (or disable by default and enable for specific collections) you can set the disabled key to true. Note: the generated downloads.yml disables links by default.

MS C 271:
  disabled: true

The size of the download can be hardcoded as the size key (as above), or an accessor on the solr document can be provided (as a string). For instance, if you have a #finding_aid_size method on your SolrDocument class that can return the size for a file, you can reference that and it will be used to provide the size in the download link text (it is okay to not provide a size at all).

MS C 271:
  pdf:
    template: http://example.com/%{pdf_id}.pdf
    size: finding_aid_size

There are custom values that can be interpolated into the URL as well. Currently this includes repository_id which is the key that is being used in the repositories.yml configuration for that document's repository.

Since this is using string interpolation, the accessor can return the entire URL to be provided (and in this case, it will not escape the URL as it will w/ other values).

MS C 271:
  pdf:
    template: %{finding_aid_url}

Advanced: Using another Solr instance

If you have another Solr instance that you are using that's not on the default location on localhost, you can provide the SOLR_URL environment variable to index into that service:

SOLR_URL=http://solr.example.com/solr FILE=myead.xml REPOSITORY_ID=myid bundle exec rake arclight:index

Advanced: Purging your Solr instance

Normal indexing will overwrite your content with the ArcLight index software. You may, however, want to remove all of your Solr documents if your content has changed, then re-index your current content.

bundle exec rake arclight:destroy_index_docs
bundle exec rake arclight:index ...

Advanced: Customizing behavior of indexing components w/o IDs

While it is highly recommended that you index EAD that has consistent IDs for all components, we do mint an ID for you if we encounter a component without an ID. This can be customized in a few ways.

By default, the indexer will use something similar to an xpath to the component (but including indexes to make sure always have a unique value for each component) and uses SHA1 to create a hexdigest. This will then be added to the ID of the collection to generate the document ID (similar to other documents that have IDs).

It's possible to use another algorithm by updating Arclight::HashAbsoluteXpath.hash_algorithm

Arclight::HashAbsoluteXpath.hash_algorithm = Digest::SHA256

This can be any object that will respond to #hexdigest with the value to be hashed as the parameter and return the hashed value.

An entirely different strategy can also be used by updating Arclight::MissingIdStrategy.selected

Arclight::MissingIdStrategy.selected = MyMissingIdStrategy

The class being used as a strategy can take the XML node as a parameter to the initializer and must return the minted ID (minus the collection ID, which will be automatically added) in response to the #to_hexdigest method.

Using Traject

Traject is the adopted new way forward for indexing content into ArcLight.

bundle exec traject -u http://127.0.0.1:8983/solr/blacklight-core -i xml -c lib/arclight/traject/ead2_config.rb spec/fixtures/ead/sample/large-components-list.xml