Skip to content

Curation Manual Wiki

David Osumi-Sutherland edited this page Sep 30, 2020 · 8 revisions

VFB curation manual wiki

A home for VFB curation guidelines and SOPs.

Repository Overview:

/records
   relations_spec.yaml  # Specification of relations that are legal to use in curation

   new_datasets/ # Curation records for adding new datasets
      ds_spec.yaml # Specification of fields used in dataset curation
      working/    # Records here are checked for syntax only
      to_submit/  # Records here are fully checked and loaded to a test DB.  
                   A Jenkins job is used to load passing records from here into the KB.

   new_images/ # Curation records for adding new images
      common_fields_spec.yaml # Specification of fields that may be used in all new_image curation
      anat_spec.yaml # Specification of fields for new anatomy image curation
      split_spec.yaml # Specification of fields for new split image curation
      ep_spec.yaml # Specification of fields for new expression pattern image curation
      working/  # Records here are checked for syntax only
      to_submit/ # Records here are fully checked and loaded to a test DB.  
                   A Jenkins job is used to load passing records from here into the KB.

   new_metadata/ # Curation records for adding new metadata to existing images
      common_fields_spec.yaml
      newmeta_spec.yaml
      working/  Records here are checked for syntax only
      to_submit/ Records here are fully checked and loaded to a test DB.  
                   A Jenkins job is used to load passing records from here into the KB.

   archive/    # Archive submitted records here

Pipeline tracking and reporting:

We run a number of pipelines from external datasources. We track progress of curation through these pipelines via reports generated nightly on VFB_reporting_results. See accompanying README.md for details of file contents.

Release cycle:

Curation and images are staged for public release on our staging servers, following ad hoc^ runs of the VFB pipeline. Following pipeline runs, new content should be searchable, queryable and browsable on v2a.virtualflybrain.org.

^ This may move to a regular cycle in the near future.

Curation file and ticket/card naming convention

Curation progress is tracked via the DataSet staging board.

Curation files are named: {Type}_{DataSetName}_YYMMDD e.g. Anat_Berck2016_191015.

  • The type prefix is needed as this is used by the parser to determine how to process.
  • DataSetName is required in order to attach curation to the correct dataset.
  • YYMMDD is needed as there may be > 1 curation record per dataset.

Types:

  • Expression pattern (ep): Used to load new (single driver) expression pattern images.
  • Split (split): Used to load new split expression pattern images.;
  • Anatomy (anat): Used to load new anatomical images (e.g. a neuron o
  • New Metadata (newmeta): Used to extend annotation on existing images.

Curation cards/tickets on the board follow a similar naming convention:

Card/ticket Name Example Description Project board SOP
DS: {Source} {DataSetName} DS: L1EM Berck2016 DataSet Epic In DataSets column until subtasks complete
Images: {Source} {DataSetName} Images: L1EM Berck2016 Image loading task for DataSet Move through sprint columns
Curation: {Source} {curation filename} Curation: L1EM Berck2016_191015 Curation task for DataSet. Move through sprint columns
Anatomy: {Source} {DataSetName} Anatomy: L1EM Berck2016 Ontology task for DataSet Card with link to FBbt ticket. Move through sprint columns
Features: {Source} {DataSetName} Features: FlyLight Ito2015 Feature curation task for DataSet Move through sprint columns

Sources here are large-scale projects/data providers: FlyLight; FlyCircuit; L1EM; FAFB; FlyEM...

DataSet naming convention:

  • short_form = surname of first author + year e.g. Berck2016. Where this would => multiple datasets with the same name, extend with a single lower-case letter a, b, c etc as need.
  • DataSet label = a longer name that is descriptive of DataSet contents. Guidelines and examples TBA

Curation record overview:

Warning - this is an overview and may be out of date. For the latest spec please see YAML spec files (linked below).

Curation files are plain .tsv (unquoted) or .yaml files. All fields may be specified in a .tsv files, but some may be optionally specified in an accompanying .yaml file. This is useful for fields whose content applies to all rows in a .tsv file, e.g. for images this might include dataset, imaging_type and template (see below for an example). Any accompanying .yaml file must have the same name as the tsv file, apart from the extension (.tsv/.yaml)'

Within curation files, ontology terms and FlyBase features are all specified by name (see below for details of how to cope with special characters). Where fields take multiple entries, these are separated by a '|'. VFB individuals (the structures depicted in images) are specified by internal VFB ID or external DB ID. DataSets are referred by their short_form (e.g. Berck2016)

DataSet curation

YAML spec; Example - tba

Image curation

Curation files for common fields

Some fields are common to all images in a dataset, so specifying them individually for all rows in a data file would be inefficient. We specify these in simple YAML files.

e.g.

dataset: Berck2016
template: L1EM
imaging_type: TEM
curators: [CP, DOS, RC]  # Need convention for this - all are converted to orcids in DB

Curation record types for extending annotation on existing images:

STATUS: This works - please try it!

NewMetadata (YAML spec; Example TSV; Example YAML):

Add new metadata by specifying relationships: subject, object, relation & optional comment/pub with evidence for relationship. subject may be referred to by VFB id, or using some external ID. Relation and object are referred to by name. Relation must be one specified in relations_spec.yaml.

  • subject_external_db: VFB DataBase ID for external DB
  • subject_external_id: External ID for subject in database referred to
  • subject_id: VFB ID of subject
  • subject_name: Optionally provide subject name for cross-check
  • relation: The relation must be either is_a or one of a standard set agreed for curation - see relations_spec.yaml
  • object: The name of an ontology term (typically FBbt) or a FlyBase feature - see relations_spec.yaml.
  • object_external_db: VFB DataBase ID for external DB
  • object_external_id: External ID for subject in database referred to
  • ind_object_id: VFB ID of object

Options for specifying object:

  • specify an individual object with either and xref object_external_db + object_external_id or an id (ind_object_id) + name (object) used for checking OR
  • specify a type object with an FBbt name field only

Order of precedence: Xref over-rides VFBid. Both over-ride object field as type name.

Curation record types for adding new images:

STATUS: Development still in progress

(common_fields.yaml):

  • Expression pattern (YAML spec; Example - TBA): Specify a driver using a FlyBase feature name. Submission of this curation record will create the expression pattern node if it does not already exist.
  • Split (YAML spec; Example - TBA): Specify AD and DBD using FlyBase feature names. Submission of this curation record will create the appropriate split expression pattern node if it does not already exist.
  • Anatomy (YAML spec; Example): Specify Classification (IS_A) and reasons for classification; Optionally specify a driver.

Image (depicted entity) naming convention

TBA.

Some notes:

  • Preserving original names of entities is essential.
  • Sometimes there are no clear original names at the individual level - only for classes. In these cases we need to make the names unique, consistent and informative. The simplest way to do this is to name for type + dataset + some number if needed for uniqueness.