Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need field for specifying session start time accuracy #321

Closed
rly opened this issue Oct 31, 2019 · 19 comments · Fixed by hdmf-dev/hdmf#874
Closed

Need field for specifying session start time accuracy #321

rly opened this issue Oct 31, 2019 · 19 comments · Fixed by hdmf-dev/hdmf#874
Assignees

Comments

@rly
Copy link
Contributor

rly commented Oct 31, 2019

The NWBFile dataset session_start_time is designed to represent the real-world start time of the session. Based on discussions with @oruebel, I see the value in this field is that you can tell what time of day a dataset was collected, if it occurred at the same time as an earthquake, etc.

But, most systems are not going to store times totally accurately with respect to the world clock at atomic resolution. How do we represent the inaccuracy of the session start time? Most datasets supply precision down to the second. Some datasets, for one reason or another, only know what day that the dataset was collected and thus enter 00:00:00 for the time, e.g. for this DataJoint to NWB converter:
https://github.com/vathes/DJ-NWB-Li-Daie-2015-2016/blob/ae69335a23e6a3c0c3809a54de5480baf8848712/pipeline/export/datajoint_to_nwb.py#L23

@oruebel and I think we should add a float attribute on session_start_time called accuracy that represents the +/- amount of time to which the session_start_time is estimated to be inaccurate. e.g., if the user thinks the value is accurate down to the second, then they should enter 0.5 for the accuracy. If they only know the day, then they should enter noon of that day as the start time and 43200 (12x60x60) for the accuracy. It would be optional, the default value would be np.nan, and we should encourage users to specify this value in tutorials and our best practices document.

This is a minor issue, but we should address it, or else, the session_start_time field will not have a consistent usage and meaning across files.

@rly
Copy link
Contributor Author

rly commented Oct 31, 2019

@ajtritt @bendichter what do you think about this idea?

@t-b
Copy link
Contributor

t-b commented Oct 31, 2019

session_start_time is supposed to be an ISO8601 string since NeurodataWithoutBorders/pynwb#641 got merged.

I'm using something like

DATASET "/session_start_time" {
   DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   DATA {
      "2019-09-23T19:10:03.435Z"
   }
}
}

for NWBv1 currently. And I will use the same for NWBv2. The accuracy could be just implemented by only writing out the digits you trust or?

@rly
Copy link
Contributor Author

rly commented Oct 31, 2019

That would work if accuracy aligned with the segments of an ISO8601 string, but not if you know that the recording happened around 9:30am +/- 10 minutes because your acquisition computer is not connected to the internet and the clock is off by about that much.

@t-b
Copy link
Contributor

t-b commented Oct 31, 2019

@rly ISO8601 supports durations so you could say that the starting time was 9:20 to 9:40.

But I'm actually not sure that this is a viable route as most date parsing tools only support dates and times and not durations. E.g. https://dateutil.readthedocs.io/en/stable/parser.html does not seem to support durations.

So why not have a accuracy attribute. We could also call it resolution as we have that already. And having it as fractional seconds (or NaN) sounds good as well.

@rly
Copy link
Contributor Author

rly commented Oct 31, 2019

@t-b Ah, I see what you mean. Sorry for misreading.

Pandas supports ISO8601 durations (link), and we already import pandas, so we could use that to parse the durations. Representing an inaccuracy range of 10 minutes as either "P0DT0H10M0S" or 600.0 would be fine with me.

Although we do have resolution already, I think that does not mean the same thing as accuracy here. Perhaps inaccuracy_range is a better term.

@rly
Copy link
Contributor Author

rly commented Nov 20, 2019

Just to be explicit, the proposal on the table is now:

  • Support a new data type in the schema, perhaps called isoduration (we already have isodatetime). The value would be an ISO 8601 extended formatted string for durations, e.g., "P0DT0H10M0S"
  • In PyNWB, this would be parsed using pandas or the isodate Python module.
  • The dataset session_start_time of NWBFile would change to:
- name: session_start_time
  dtype: isodatetime
  doc: ...
  attributes:
  - name: uncertainty
    dtype: isoduration
    doc: The uncertainty of the session start time. The actual session start time may be within 
      plus or minus the uncertainty value from the reported session start time. For example, if 
      it is known only that the session start time was sometime on July 15, 2019 in UTC-7, the 
      session start time should be July 15, 2019 at noon in UTC-7 ("2019-07-15T12:00:00-07:00"),
      and the uncertainty should be 12 hours ("P0DT12H0M0S").
    required: false

@rly rly self-assigned this Nov 20, 2019
@oruebel
Copy link
Contributor

oruebel commented Nov 20, 2019

CC @ln-vidrio

@bendichter
Copy link
Contributor

I feel like the natural solution to the session start time accuracy issue is to let users just supply a date or a time without ms. I think the accuracy thing is overkill and will rarely be used correctly.

@t-b
Copy link
Contributor

t-b commented Nov 20, 2019

I would prefer to use isodatetime module for parsing instead of pandas as pandas supports also non-ISO8601 durations.

@rly
Copy link
Contributor Author

rly commented Nov 20, 2019

@bendichter Yeah, I also think this would rarely be used correctly. 99.9% of cases will be either:

  1. the user supplies the time down to the second based on the computer's clock, and they don't know how accurate that is relative to the world clock.
  2. the user supplies the time down to the day because they don't remember what time they ran the experiment and have no record of it.

In both cases, I think most users would not bother to enter an uncertainty value, and if they don't know the time or the second, it is most likely not important to the study.

Also, acquisition systems do not know the accuracy of their clocks, so any files they create will have unknown uncertainty.

I like @bendichter and @t-b 's suggestion of just supporting datetimes with reduced accuracy. ISO 8601 allows it - "For reduced accuracy, any number of values may be dropped from any of the date and time representations". You can't encode uncertainty values of +/- 3 hours, but it will cover 99.9% of use cases. Unfortunately, all datetime parsers confound missing values with zero values. We could write our own modified datetime reader/writer to differentiate those cases, so the uncertainty is documented in the file (non read/write uses of the datetimes would use 0s for missing values), but that could get messy. Thoughts?

Thanks for all the feedback! I consider this a very minor issue. I say: if there is a need from users, then let's move forward with one option or the other (I lean toward creating a modified datetime string reader/writer). In the meantime, I'll close the issue.

@rly rly closed this as completed Nov 20, 2019
@bendichter bendichter reopened this Dec 11, 2019
@nandchandravadia
Copy link

nandchandravadia commented Dec 11, 2019

Hi,

Is it possible to make session_start_time optional? Since we plan on releasing our data publicly, one issue we currently have is with regards to protecting PHI (patient health information) by not putting the actual date of the experiment. Instead, we have been putting an arbitrary date (e.g., 1-1-1900), which is obviously incorrect and can be misleading. Another possibility would be to allow the user to specify the desired resolution of the session_start_time. In our case, we would just input the year and month.

@bendichter
Copy link
Contributor

@rly I don't really see this as a minor issue, as it comes us for us all the time in different ways:

  • A number of dandisets use midnight clearly as a place-holder. It would be better to just have the date, indicating that time time is unknown.
  • It is common that we can pull out date but not time from metadata, which requires us to fill in data which is often just a midnight placeholder.
  • It is common that we can pull out datetime, but not timezone, which is filled as the local timezone, which may or may not be correct.
  • If you don't really know the time, and you are just filling in midnight then the timezone is completely pointless and just extra work for no reason. On top of that, the syntax for timezone is not straightforward.

In the end, these requirements force us to enter data which is often wrong. It would be so much better and simpler to allow:

  • datetime with timezone
  • datetime without timezone
  • date

If using date, the timestamp_reference_time can still use midnight of that day.

@rly
Copy link
Contributor Author

rly commented Jul 1, 2023

This sounds good to me. @oruebel any thoughts?

@bendichter
Copy link
Contributor

@lawrence-mbf What do you think? Do you think it would be possible to relax this on the MatNWB side as well?

@oruebel
Copy link
Contributor

oruebel commented Jul 4, 2023

The option to allow datetime with timezone and date sound reasonable. I'm not sure what the option to allow datetime without timezone adds.

@CodyCBakerPhD
Copy link
Contributor

I'm not sure what the option to allow datetime without timezone adds.

I get the logic of the principle of how it could be seen as required (experiment had to take place somewhere; if institution is in the NWB file you could even infer what timezone it is in from that)

I think what we take more issue with is that requiring it means the APIs like PyNWB (IDK what MatNWB does) then auto-attach it to the datetime.datetime object if it is missing, and it sets it to the timezome on the local device creating the NWB file, which from experience is not always the same as the device the experiment was performed on, nor is it always performed by the same lab, or necessarily in the same timezone

The syntax for attaching a proper ZoneInfo object to a datetime.datetime is also not entirely trivial or super commonly well known (though we use it liberally throughout the NeuroConv galleries in an attempt to educate) - though helper functions could of course be made for this

So overall it might just be better to loosen it to be an optional detail to include along with the other loosened resolutions of the session_start_time, so that fewer potentially false assumptions are made at time of file creation. Either that or, if we want to keep it required, stop assuming that it's the 'local timezone' according to Python.

@oruebel
Copy link
Contributor

oruebel commented Jul 5, 2023

I think what we take more issue with is that requiring it means the APIs like PyNWB (IDK what MatNWB does) then auto-attach it to the datetime.datetime object if it is missing

Thanks for the clarification @CodyCBakerPhD, that makes sense. It sounds like independent of this schema issue, there is an issue in PyNWB in that we: a) zone information should not be automatically added and b) specifying date information should be simpler. For a) we'd probably need to add/update logic in the NWBFile class in PyNWB. For b) we could look at the ZoneInfo module that was added in Python 3.9 to see if that helps simplify the code.

If we decide to allow "datetime without timezone" then I would still suggest that we look at enhancing PyNWB (NeuroConv, GUIDE etc.) to make it easier and encourage the use of proper timezone information.

@lawrence-mbf
Copy link

This should be doable on MatNWB side. datetime can be represented as all three forms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants