Skip to content

iDigBio/idb-ee-exporter

Repository files navigation

Code for exporting data sets for parties interested in maintaining copies of iDigBio

EE, and other mirroring parties, want differences/incremental data. Use CSV as an interchance format, limit fields to what's in the index, use a flat representation of the data akin to our existing download system however, pack everything in to one file. Diffs/patches of CSVs are probably not useful since that helps construct a whole snapshot in CSV form when what people want is to know what operations they need to do on thier data system to update their data. (I suppose people could write a diff/patch text format to database operation converter but we could be more helpful to them by providing better info.) What they likely want is records flagged as new/change/del so they can read them and apply their data update process to each. Would be cool for our own explorations too since the only change over time metric we currently have is total number of records.

File naming

  • idigbio-datetime1-ee.csv.bz2 as the base full dump at a specific time
  • idigbio-datetime1-datetime2-ee-diff.csv.bz2 as the first diff between T1 full and T2 full
  • idigbio-datetime2-ee.csv.bz2 as the base full dump at second time

Each diff can be applied in order to move the data version up to the next one. Any full copy can be used as a starting point. The datemodified field is discontinious enough that it should be easy to tell what full snapshot you're currently on by looking at the max, that is unless two full snapshots are identical, the max datetime will be different.

Code outline

Notebooks are the exploration and development of the code, please review them for more details. The jobs files are the code that is actually run to produce output.

Differ

Given two full exports in parquet form generated by GUODA, generate a list of records that are different and add a column that is new/change/del to indicate the operation needed. Write out a .parquet with the differences back to the HDFS store in /guoda/outputs. Differences are determined by the presence or absence of record identifiers (iDigBio uuids) and by the last mode date of records. This means that the individual field changes are not tracked and whole rows need to be replaced when updating a record.

Checker

Currently there's a notebook that checks diffs to see if they are realistic. There needs to be an automated diff applier that checks to make sure the diff can move you between two versions of the full dumps.

Exporter

Given an iDigBio parquet, format the data for EE and write out CSV file. Originally we were planning on slimming down the width of the outputted CSV to be the "relevent" fields for a specific purpose such as EE or mapping in general. After reflecting further, there's no reason not to try to include everything as most mirroring parties won't care much about a few GB of disk space and fields like verbatim locality which are a lot of the size are relevent to mapping.

CSVs are written back to HDFS. Both full dumps and diffs will be written back although it likely makes sense to only write occasional full dumps in CSV format. We are keeping all full dumps in parquet format already.

Publisher

Write the given csv from HDFS in to Ceph. The idigbio-static-downloads bucket or idigbio-guoda-prod bucket are the best places for these to go, probably the idigbio-guoda-prod bucket is better since these datasets are 4 steps removed form the cannonical data store so they are derivatives of iDigBio and less a product of it.

Consider an RSS feed of diffs, this should update that.

Attribution

How do people using EE know who is responsible for providing the data they're looking at?

Data documentation

Need to talk about changes to column layout - differ will seamlessly union different structures but CSV exports will have different headers that other tools nead to deal with.

About

Code for exporting data sets for Earth Engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published