Code for exporting data sets for parties interested in maintaining copies of iDigBio

EE, and other mirroring parties, want differences/incremental data. Use CSV as an interchance format, limit fields to what's in the index, use a flat representation of the data akin to our existing download system however, pack everything in to one file. Diffs/patches of CSVs are probably not useful since that helps construct a whole snapshot in CSV form when what people want is to know what operations they need to do on thier data system to update their data. (I suppose people could write a diff/patch text format to database operation converter but we could be more helpful to them by providing better info.) What they likely want is records flagged as new/change/del so they can read them and apply their data update process to each. Would be cool for our own explorations too since the only change over time metric we currently have is total number of records.

File naming

idigbio-datetime1-ee.csv.bz2 as the base full dump at a specific time
idigbio-datetime1-datetime2-ee-diff.csv.bz2 as the first diff between T1 full and T2 full
idigbio-datetime2-ee.csv.bz2 as the base full dump at second time

Each diff can be applied in order to move the data version up to the next one. Any full copy can be used as a starting point. The datemodified field is discontinious enough that it should be easy to tell what full snapshot you're currently on by looking at the max, that is unless two full snapshots are identical, the max datetime will be different.

Code outline

Notebooks are the exploration and development of the code, please review them for more details. The jobs files are the code that is actually run to produce output.

Differ

Given two full exports in parquet form generated by GUODA, generate a list of records that are different and add a column that is new/change/del to indicate the operation needed. Write out a .parquet with the differences back to the HDFS store in /guoda/outputs. Differences are determined by the presence or absence of record identifiers (iDigBio uuids) and by the last mode date of records. This means that the individual field changes are not tracked and whole rows need to be replaced when updating a record.

Checker

Currently there's a notebook that checks diffs to see if they are realistic. There needs to be an automated diff applier that checks to make sure the diff can move you between two versions of the full dumps.

Exporter

Given an iDigBio parquet, format the data for EE and write out CSV file. Originally we were planning on slimming down the width of the outputted CSV to be the "relevent" fields for a specific purpose such as EE or mapping in general. After reflecting further, there's no reason not to try to include everything as most mirroring parties won't care much about a few GB of disk space and fields like verbatim locality which are a lot of the size are relevent to mapping.

CSVs are written back to HDFS. Both full dumps and diffs will be written back although it likely makes sense to only write occasional full dumps in CSV format. We are keeping all full dumps in parquet format already.

Publisher

Write the given csv from HDFS in to Ceph. The idigbio-static-downloads bucket or idigbio-guoda-prod bucket are the best places for these to go, probably the idigbio-guoda-prod bucket is better since these datasets are 4 steps removed form the cannonical data store so they are derivatives of iDigBio and less a product of it.

Consider an RSS feed of diffs, this should update that.

Attribution

How do people using EE know who is responsible for providing the data they're looking at?

Data documentation

Need to talk about changes to column layout - differ will seamlessly union different structures but CSV exports will have different headers that other tools nead to deal with.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
.gitignore		.gitignore
Checker_Notebook.ipynb		Checker_Notebook.ipynb
Differ_Notebook.ipynb		Differ_Notebook.ipynb
Exporter_notebook.ipynb		Exporter_notebook.ipynb
LICENSE		LICENSE
README.md		README.md
differ.py		differ.py
differ_run.sh		differ_run.sh
exporter.py		exporter.py
exporter_run.sh		exporter_run.sh
mk_test_data.py		mk_test_data.py
mk_test_data_run.sh		mk_test_data_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for exporting data sets for parties interested in maintaining copies of iDigBio

File naming

Code outline

Differ

Checker

Exporter

Publisher

Attribution

Data documentation

About

Releases

Packages

Contributors 2

Languages

License

iDigBio/idb-ee-exporter

Folders and files

Latest commit

History

Repository files navigation

Code for exporting data sets for parties interested in maintaining copies of iDigBio

File naming

Code outline

Differ

Checker

Exporter

Publisher

Attribution

Data documentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages