Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detailed data types #215

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tacaswell
Copy link
Contributor

Description

This is, modulo a massive and possibly un-needed re-organization and some bug-fixes (the first 5 commits are not strictly needed for this work), the implementation of #214 .

The consensus I reached in #214 is to use the numpy dtype.str and dtype.descr as 2 additional keys which gives us enough information to identify both "built in" types and structured types using a pre-existing scheme. This was picked over the PEP3118 string formatting due to the wider adoption and better documentation of the numpy scheme over the pep scheme. 2 keys was chosen over 1 key of variably type to avoid the type instability. There may be a case that the descr field should be extra optional (we must have 'dtype', we may have a 'dtype_str' and if we have a 'dtype_str' we may also have a 'dtype_descr').

The rules for getting back to the numpy dtype is:

  • if the type is not 'V', then dt = np.dtype(dk['dtype_str'])
  • if the type is 'V', then dt = np.dtype(dk['dtype_descr'])

which is fiddly, but I think acceptable. It may be possible to get more inside the head of np.dtype and pass some function in numpy both the str and the descr and let it sort things out, but I have not found that function yet.

There is more information in the __aray_protorol__ bundle, like the offsets or padding, that we are not capturing here because that is a hardware dependent detail and not machine invariant structure. That is, from the point of view of the event model [('a', 'u1'), ('b', 'f8')] with the float align to the byte boundary or to the 8 byte boundary are "the same". Describing the exact in-memory layout should be left to a library (like tiled!) that handles serialization / communication between processes.

Related, given the above discussion one could argue that we should be dropping the endianness of the data (as that is the poster-child for machine dependent details!), but I think the cost of carrying around a bit of "too detailed" information is an acceptable cost of not having to invent and describe a variation on the numpy scheme that ignores the endianness.

Motivation and Context

Closes #214

How Has This Been Tested?

  • has some tests that push through descriptors with the additional keys
  • still needs test of showing the schema failing
  • still needs test of deeply nested dtypes

Docs

Need to edit and migrate my ranting it #214 to the docs.

cross project work

  • ophyd needs to start generating improved responses to .describe. This should be back-compatible as the model did not say data_keys was not allowed to have additional keys so any thing consuming them should already be able to ignore the extra keys (but not 0 risk)
  • tiled needs to learn how to start looking for and using these additional keys to advertise to clients the additional layer of structure
  • we need to retro-fit (either in a migration or with a tranformer) the descriptors from SIX
  • make sure that tiled can cope with a data-frame handler (using the above information)

Split all of the code out into (private for now) submodules.
Due to the constraint that keys can not have '.' in them (due to mongo) we
have a (recursive) constraint that user supplied dicts (objects in json lingo)
must not have '.' in their keys.
The dtype of a [512, 512] array should be 'array' not 'number'
Increment the event count by the number of new events, not the number of keys
in the events!
This will extract the shape, (json)dtype, and the detailed numpy dtype
information for a value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding support for detailed and structured data types
1 participant