Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for processing datasets from Popler #95

Open
ha0ye opened this issue Feb 28, 2019 · 7 comments
Open

Add code for processing datasets from Popler #95

ha0ye opened this issue Feb 28, 2019 · 7 comments
Assignees
Labels
dataset adding new data to MATSS

Comments

@ha0ye
Copy link
Member

ha0ye commented Feb 28, 2019

Popler is a package for obtaining LTER datasets in a (somewhat) standardized way. We are going to need code that processes the data into the format that we need it in for MATSS:

Obtaining the data files

  • First, identify the datasets that match our needs, by specifying the arguments for pplr_browse(...)
  • Next, get the raw data for a particular dataset with pplr_get_data(...)

Metadata

  • identify the columns in data that do not change across observations - these are likely to be values that will go in the metadata list
  • from the community metadata (output of pplr_browse?), extract the species table to add to metadata

Covariates

  • need to parse the right things to construct a time index column, and attach the name to metadata
  • use pplr_cov_unpack(data) and munge the name and value columns into the covariates table
@ha0ye
Copy link
Member Author

ha0ye commented Feb 28, 2019

@bleds22e has been working with the popler package, and will make a new branch for this work (so that everyone can help out).

@ha0ye
Copy link
Member Author

ha0ye commented Mar 21, 2019

(currently on hold; see #101)
resolved

@ha0ye ha0ye added this to the Spring Semester Goals milestone Mar 21, 2019
@ha0ye ha0ye added the dataset adding new data to MATSS label Mar 21, 2019
@diazrenata
Copy link
Member

Currently, the next steps are:

  • Figure out a permanent storage solution
  • Functions to process and add to MATSS

@ha0ye ha0ye mentioned this issue Apr 23, 2019
15 tasks
@ha0ye
Copy link
Member Author

ha0ye commented May 28, 2019

Update on Popler data integration:
• the LTER sites that are included in Popler's database each have their site-specific data transformed into Popler's format
• this can contain mixtures of different data sampling schemes, so generating community time series data is non-trivial

thoughts on ways forward:
• contact LTER data managers for already-prepared time series datasets
• manually clean and transform each of (many) datasets ourselves
• see if Popler has information on the backend about the different types of datasets it's pulling in from each LTER, maybe this allows us to more quickly filter for time series data (contact Aldo for this?)
• what is the overlap of datasets with BioTime? (is it easier to try and get these datasets from BioTime?)

current status:
• we are compiling some summary tables on how the different LTER sites have their data organized hierarchically within Popler (Popler calls these "spatial replication levels")
https://github.com/ha0ye/popler will eventually contain generated Rmarkdown reports for these summary tables (one report for each LTER dataset entry), to be uploaded once they are finished being generated

😵 😫

@ha0ye
Copy link
Member Author

ha0ye commented May 29, 2019

@diazrenata also suggested we could do some digging through the source code for popler to see if that yielded any clues about how it might be processing data on its end.

@ha0ye
Copy link
Member Author

ha0ye commented Jun 20, 2019

@diazrenata also suggested we could do some digging through the source code for popler to see if that yielded any clues about how it might be processing data on its end.

It sounds like there may be unique code for importing each dataset into popler, so this may not be a feasible path to lessen the workload of manually dealing with each dataset.

I think our steps forward are:

  • don't worry about replication levels too much, and just check for the names of the spatial replication level variables -- if there are a lot of high-level ones named "site", that may be a good thing to use to split datasets in Popler into separate communities. Otherwise, just assemble the communities as is (i.e. aggregate over other spatial replication levels).
    AND/OR
  • use BioTime if the datasets from Popler are overlapped in BioTime. Given the relative sizes of the databases, it seems unlikely, and that Popler has data that are definitely not in BioTime? Though maybe BioTime has pre-aggregated data that is more easily assembled into time series (i.e. abundances instead of a row for each raw count)

@ha0ye
Copy link
Member Author

ha0ye commented Jan 17, 2020

This issue needs a decision one way or another (i.e. whether to try and include US LTER data via Popler for V1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset adding new data to MATSS
Projects
None yet
Development

No branches or pull requests

3 participants