Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible simple(ish) approach to nest detection pipeline #128

Open
ethanwhite opened this issue Jan 15, 2023 · 1 comment
Open

Possible simple(ish) approach to nest detection pipeline #128

ethanwhite opened this issue Jan 15, 2023 · 1 comment
Assignees

Comments

@ethanwhite
Copy link
Member

I’ve been pondering how to handle the need to treat different flights differently when doing image processing:

  1. For flights that end with _B, _C, etc. only do bird prediction and mapbox creation
  2. For flights with no GCP do everything except nest detection
  3. For flights that end with nothing or _A and have GCP do everything

In the process I think I may have also stumbled across a clean solution for the complexities of the snakemake and the nest detection part of the pipeline.

Proposal:

  1. Add a new output to the predict_birds rule, which is a file "/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt”
  2. This file will contain a set of paths for individual flights of the appropriate year and site as well as timestamps for when they were last time predict.py was run on them (building on Henry’s approach).
  3. Add a new function to predict.py that reads in nest_flights_info.txt and updates it for the file being processed by either updating the timestamp if that file already exists or adding the file if it hasn’t been previously processed. The timestamp won’t be used, it’s just being changed to ensure that the file itself is changed (and to provide useful information for debugging/tracking builds by humans).
  4. This function only does this for files that meet the requirements for nest detection. So the function checks if the path either ends with just the year or the _A and it also checks the year and site against a lookup table to determine if there are GCP’s present. If either of those conditions are false then the function just returns and never touches nest_flights_info.txt.
  5. The input for the detect_nests rule is then "/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt” and we update load_files() to use the listed files in nest_flight_info.txt instead of globbing for .shp files. The output is "/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp". If nest_flights_info.txt is empty (as it will be for sites with no GCP then this is an empty shapefile.
  6. The input for the process_nests rule is "/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp" and the output is "/blue/ewhite/everglades/processed_nests/{site}_{year}_processed_nests.shp". If the input is an empty shapefile then the output is an empty shapefile (this may happen by default but might also need a quick conditional check).
  7. The combine_nests rule then takes
     expand("/blue/ewhite/everglades/processed_nests/{site}_{year}_processed_nests.shp",
                  zip, site=ORTHOMOSAICS.site, year = ORTHOMOSAICS.year)
    
    as the input.

I think this preserves the full input-output chain and should trigger appropriate rebuilds. It does this by having the output converge to a single file per site-year combination where appropriate (at the nest_detection step). That file (predictions/{year}{site}/nest_flights_info.txt) should only get updated if one of its inputs changes (i.e., a new flight is added or an existing flight changed). Therefore the steps further down the pipeline should only be run when they need to be.

That said, I’ve thought I’ve had solutions to all of this in the past and been wrong. So @henrykironde - take a look at this, see if you think it will actually work, and if so decide whether you think this approach has enough benefits in terms of simplicity (i.e., we don’t have to handle tracking file changes ourselves) to go in this direction vs. the more hands on approach you've been working on (at least as I understood it last).

@ethanwhite
Copy link
Member Author

This didn't work, but it lead to #93 which does. The ideas in (4) haven't been implemented yet, so that should be moved to a new issue and done. The key is that every real site-year combination does need a detected_nests file. It should just be empty if a site isn't being used for nest detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants