Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow precise date ranges #1304

Open
corneliusroemer opened this issue Sep 3, 2023 · 6 comments · May be fixed by #1305
Open

Allow precise date ranges #1304

corneliusroemer opened this issue Sep 3, 2023 · 6 comments · May be fixed by #1305
Labels
enhancement New feature or request

Comments

@corneliusroemer
Copy link
Member

corneliusroemer commented Sep 3, 2023

Context

Currently, date ranges can only be defined by ambiguous/incomplete dates in the format YYYY-MM-DD with some characters swapped in with the wildcard character X¹. This only allows date ranges to be bounded by start/end of a year or month and not a more precise range within or across year/month boundaries.

¹ this could be better documented: #882

Scope

Anywhere that currently accepts ambiguous/incomplete dates (i.e. uses AmbiguousDate):

  • augur filter
  • augur frequencies
  • augur refine

Potential solutions

  1. ⛔️ Allow scalar boundary values under new columns e.g. --min-date-column / --max-date-column
  2. ✅ Allow a new range format for values under the existing date column¹
    1. ⛔️ Allow the TreeTime format of [min:max] where min and max are numeric dates
    2. ✅ Allow the ISO format of min/max where min and max are ISO dates
    3. ⛔️ Merge (1) and (2) allowing a format of [min/max] where min and max are any scalar date accepted by Augur (ISO or numeric)

¹ there is a feature request to make the date column name customizable: #1443


original issue description

Context

To ensure patient privacy, Denmark bins SARS-CoV-2 collection dates to the Monday of the week the actual collection date lies in.

Currently, we seem to be unable to specify such ambiguity within augur refine even though treetime supports arbitrarily constrained date ranges.

The only workaround right now I can think of is to make the date ambiguous, but that throws away information while also not working in situations where the week crosses a month boundary.

The issue has become particularly noticeable when making BA.2.86 trees where Denmark has provided ~30% of global sequences so far. To remove bias, I could add 3/4 days to the dates, but it would be nice if refine just accepted ranges as such.

Description

Make refine support arbitrary min/max input dates.

Possible solution

The simplest way to implement this would be to accept the min/max date format that's natively supported by treetime: [min:max] as in [2022.3452:2022.3649] (opening bracket, min date as year float, colon, max date as year float, closing bracket)

A neater solution might be to allow min and max dates to be specified through two columns: --min-date and --max-date.

This change could potentially be made across all date handling: it would be more general than our current x'ing strategy, e.g. 2021-10-XX for ambiguous date of month. That'd be a lot of work though.

@corneliusroemer corneliusroemer added the enhancement New feature or request label Sep 3, 2023
@corneliusroemer corneliusroemer changed the title ENH: Allow generalized ambiguous dates with arbitrary min/max not confinded arbitrarily to year/months ENH(refine): Allow generalized ambiguous dates with arbitrary min/max not confinded arbitrarily to year/months Sep 3, 2023
corneliusroemer added a commit that referenced this issue Sep 4, 2023
Add support for the treetime ambiguous date format `[float_min:float_max]`, e.g. `[2004.43:2005.58]`
This is useful for when ambiguity is not limited
to month/year boundaries.
Resolves #1304
@corneliusroemer corneliusroemer linked a pull request Sep 4, 2023 that will close this issue
3 tasks
corneliusroemer added a commit that referenced this issue Sep 6, 2023
Add support for the treetime ambiguous date format `[float_min:float_max]`, e.g. `[2004.43:2005.58]`
This is useful for when ambiguity is not limited
to month/year boundaries.
Resolves #1304

(cherry picked from commit df94593)
@victorlin victorlin changed the title ENH(refine): Allow generalized ambiguous dates with arbitrary min/max not confinded arbitrarily to year/months Allow precise date ranges Jun 27, 2024
@victorlin
Copy link
Member

victorlin commented Jun 27, 2024

This was discussed recently on Slack with some different ideas. I'll plan to work on this soon so I've updated the issue description with latest ideas.

Commenting on some things from the original issue description:

The simplest way to implement this would be to accept the min/max date format that's natively supported by treetime: [min:max] as in [2022.3452:2022.3649]

On the Slack discussion the existing ISO standard format <start>/<end> was brought up as a 2nd option. But I do think the brackets are nice. I've proposed a 3rd option. While it doesn't follow any standards, it seemed like the best of both worlds: the brackets convey an inclusive range notation, and : could be problematic down the road if supporting ISO dates with a time part. Both added to the new issue description.

A neater solution might be to allow min and max dates to be specified through two columns: --min-date and --max-date.

I'll shoot this down as this would be confusing if/when we're to bring this to augur filter which already has --min-date/--max-date. I think it makes more sense to augment the existing handling of ambiguous dates in the date column with added support for precise date ranges.

This change could potentially be made across all date handling: it would be more general than our current x'ing strategy, e.g. 2021-10-XX for ambiguous date of month. That'd be a lot of work though.

Making this change across all date handling is the best/easiest way to do it since date handling is centralized under augur/dates. Doing it only in one subcommand would require extra work and cause more confusion to users.

@victorlin victorlin self-assigned this Jun 27, 2024
@trvrb
Copy link
Member

trvrb commented Jun 27, 2024

Cool! I agree to ditch --min-date and --max-date columns. I'm happy with various options for ranges:

  • colon: 2024-01-01:2024-06-01
  • colon/bracket: [2024-01-01:2024-06-01]
  • slash: 2024-01-01/2024-06-01
  • slash/bracket: [2024-01-01/2024-06-01]

Though I think I most prefer either Pythonic colon/bracket eg [2024-01-01:2024-06-01], or ISO slash eg 2024-01-01/2024-06-01 rather than mishmash. I don't see an issue with : and including time part of date, just because molecular clock dating, sample collection, etc... should never demand this level of precision.

@jameshadfield
Copy link
Member

jameshadfield commented Jun 27, 2024

I vote to stay consistent with the ISO format we're already following, I don't see any compelling reasons to deviate from it. It also allows more concise forms like "2008-02-15/03-14", where "/03-14" implies "/2008-03-14".

@victorlin
Copy link
Member

I considered blocking this on #936 but #1305 shows that this functionality can be implemented within get_numerical_date_from_value() which should be sufficient.

@tsibley
Copy link
Member

tsibley commented Jul 8, 2024

Consistent with ISO gets my vote.

In past work for Seattle Flu Study, we used range types in various places. That used Postgres range syntax, which includes support for inclusive ([ and ]) vs. exclusive (( and )) bounds and omitted bounds (e.g. [2008-02-15, )). Would those features be useful/desired here?

@victorlin
Copy link
Member

Taking my name off as I haven't gotten to this and it doesn't seem high priority at the moment.

@victorlin victorlin removed their assignment Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants