You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve been pondering how to handle the need to treat different flights differently when doing image processing:
For flights that end with _B, _C, etc. only do bird prediction and mapbox creation
For flights with no GCP do everything except nest detection
For flights that end with nothing or _A and have GCP do everything
In the process I think I may have also stumbled across a clean solution for the complexities of the snakemake and the nest detection part of the pipeline.
Proposal:
Add a new output to the predict_birds rule, which is a file "/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt”
This file will contain a set of paths for individual flights of the appropriate year and site as well as timestamps for when they were last time predict.py was run on them (building on Henry’s approach).
Add a new function to predict.py that reads in nest_flights_info.txt and updates it for the file being processed by either updating the timestamp if that file already exists or adding the file if it hasn’t been previously processed. The timestamp won’t be used, it’s just being changed to ensure that the file itself is changed (and to provide useful information for debugging/tracking builds by humans).
This function only does this for files that meet the requirements for nest detection. So the function checks if the path either ends with just the year or the _A and it also checks the year and site against a lookup table to determine if there are GCP’s present. If either of those conditions are false then the function just returns and never touches nest_flights_info.txt.
The input for the detect_nests rule is then "/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt” and we update load_files() to use the listed files in nest_flight_info.txt instead of globbing for .shp files. The output is "/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp". If nest_flights_info.txt is empty (as it will be for sites with no GCP then this is an empty shapefile.
The input for the process_nests rule is "/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp" and the output is "/blue/ewhite/everglades/processed_nests/{site}_{year}_processed_nests.shp". If the input is an empty shapefile then the output is an empty shapefile (this may happen by default but might also need a quick conditional check).
The combine_nests rule then takes
expand("/blue/ewhite/everglades/processed_nests/{site}_{year}_processed_nests.shp",
zip, site=ORTHOMOSAICS.site, year = ORTHOMOSAICS.year)
as the input.
I think this preserves the full input-output chain and should trigger appropriate rebuilds. It does this by having the output converge to a single file per site-year combination where appropriate (at the nest_detection step). That file (predictions/{year}{site}/nest_flights_info.txt) should only get updated if one of its inputs changes (i.e., a new flight is added or an existing flight changed). Therefore the steps further down the pipeline should only be run when they need to be.
That said, I’ve thought I’ve had solutions to all of this in the past and been wrong. So @henrykironde - take a look at this, see if you think it will actually work, and if so decide whether you think this approach has enough benefits in terms of simplicity (i.e., we don’t have to handle tracking file changes ourselves) to go in this direction vs. the more hands on approach you've been working on (at least as I understood it last).
The text was updated successfully, but these errors were encountered:
This didn't work, but it lead to #93 which does. The ideas in (4) haven't been implemented yet, so that should be moved to a new issue and done. The key is that every real site-year combination does need a detected_nests file. It should just be empty if a site isn't being used for nest detection.
I’ve been pondering how to handle the need to treat different flights differently when doing image processing:
In the process I think I may have also stumbled across a clean solution for the complexities of the snakemake and the nest detection part of the pipeline.
Proposal:
"/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt”
predict.py
that reads innest_flights_info.txt
and updates it for the file being processed by either updating the timestamp if that file already exists or adding the file if it hasn’t been previously processed. The timestamp won’t be used, it’s just being changed to ensure that the file itself is changed (and to provide useful information for debugging/tracking builds by humans).nest_flights_info.txt
."/blue/ewhite/everglades/predictions/{year}/{site}/nest_flights_info.txt”
and we updateload_files()
to use the listed files innest_flight_info.txt
instead of globbing for .shp files. The output is"/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp"
. Ifnest_flights_info.txt
is empty (as it will be for sites with no GCP then this is an empty shapefile."/blue/ewhite/everglades/detected_nests/{site}_{year}_detected_nests.shp"
and the output is"/blue/ewhite/everglades/processed_nests/{site}_{year}_processed_nests.shp"
. If the input is an empty shapefile then the output is an empty shapefile (this may happen by default but might also need a quick conditional check).I think this preserves the full input-output chain and should trigger appropriate rebuilds. It does this by having the output converge to a single file per site-year combination where appropriate (at the nest_detection step). That file (predictions/{year}{site}/nest_flights_info.txt) should only get updated if one of its inputs changes (i.e., a new flight is added or an existing flight changed). Therefore the steps further down the pipeline should only be run when they need to be.
That said, I’ve thought I’ve had solutions to all of this in the past and been wrong. So @henrykironde - take a look at this, see if you think it will actually work, and if so decide whether you think this approach has enough benefits in terms of simplicity (i.e., we don’t have to handle tracking file changes ourselves) to go in this direction vs. the more hands on approach you've been working on (at least as I understood it last).
The text was updated successfully, but these errors were encountered: