Add detailed data types #215

tacaswell · 2021-08-30T22:28:17Z

Description

This is, modulo a massive and possibly un-needed re-organization and some bug-fixes (the first 5 commits are not strictly needed for this work), the implementation of #214 .

The consensus I reached in #214 is to use the numpy dtype.str and dtype.descr as 2 additional keys which gives us enough information to identify both "built in" types and structured types using a pre-existing scheme. This was picked over the PEP3118 string formatting due to the wider adoption and better documentation of the numpy scheme over the pep scheme. 2 keys was chosen over 1 key of variably type to avoid the type instability. There may be a case that the descr field should be extra optional (we must have 'dtype', we may have a 'dtype_str' and if we have a 'dtype_str' we may also have a 'dtype_descr').

The rules for getting back to the numpy dtype is:

if the type is not 'V', then dt = np.dtype(dk['dtype_str'])
if the type is 'V', then dt = np.dtype(dk['dtype_descr'])

which is fiddly, but I think acceptable. It may be possible to get more inside the head of np.dtype and pass some function in numpy both the str and the descr and let it sort things out, but I have not found that function yet.

There is more information in the __aray_protorol__ bundle, like the offsets or padding, that we are not capturing here because that is a hardware dependent detail and not machine invariant structure. That is, from the point of view of the event model [('a', 'u1'), ('b', 'f8')] with the float align to the byte boundary or to the 8 byte boundary are "the same". Describing the exact in-memory layout should be left to a library (like tiled!) that handles serialization / communication between processes.

Related, given the above discussion one could argue that we should be dropping the endianness of the data (as that is the poster-child for machine dependent details!), but I think the cost of carrying around a bit of "too detailed" information is an acceptable cost of not having to invent and describe a variation on the numpy scheme that ignores the endianness.

Motivation and Context

Closes #214

How Has This Been Tested?

has some tests that push through descriptors with the additional keys
still needs test of showing the schema failing
still needs test of deeply nested dtypes

Docs

Need to edit and migrate my ranting it #214 to the docs.

cross project work

ophyd needs to start generating improved responses to .describe. This should be back-compatible as the model did not say data_keys was not allowed to have additional keys so any thing consuming them should already be able to ignore the extra keys (but not 0 risk)
tiled needs to learn how to start looking for and using these additional keys to advertise to clients the additional layer of structure
we need to retro-fit (either in a migration or with a tranformer) the descriptors from SIX
make sure that tiled can cope with a data-frame handler (using the above information)

Split all of the code out into (private for now) submodules.

Due to the constraint that keys can not have '.' in them (due to mongo) we have a (recursive) constraint that user supplied dicts (objects in json lingo) must not have '.' in their keys.

The dtype of a [512, 512] array should be 'array' not 'number'

Increment the event count by the number of new events, not the number of keys in the events!

This will extract the shape, (json)dtype, and the detailed numpy dtype information for a value.

TST: add py39 and py310 testing

99ae5b9

tacaswell mentioned this pull request Sep 1, 2021

ENH: add structured data support bluesky/databroker#670

Merged

tacaswell added 6 commits September 10, 2021 10:45

MNT: major refactor

d5bf3c5

Split all of the code out into (private for now) submodules.

MNT: rename internal type definition

37d097b

Due to the constraint that keys can not have '.' in them (due to mongo) we have a (recursive) constraint that user supplied dicts (objects in json lingo) must not have '.' in their keys.

TST: fix data keys info in test_em.py

907a37a

The dtype of a [512, 512] array should be 'array' not 'number'

FIX: increment the event count by correct amount

686faa8

Increment the event count by the number of new events, not the number of keys in the events!

ENH: add infer_datakeys helper

c912871

This will extract the shape, (json)dtype, and the detailed numpy dtype information for a value.

ENH: add dtype_str and dtype_descr to descriptor schema

956bf30

tacaswell force-pushed the enh_add_detailed_dtypes branch from 8a65a86 to 956bf30 Compare September 21, 2021 16:56

tacaswell marked this pull request as ready for review October 29, 2021 15:03

mrakitin self-requested a review May 10, 2022 16:03

danielballan mentioned this pull request Dec 15, 2023

Adding support for detailed and structured data types #214

Open

tacaswell mentioned this pull request May 29, 2024

signal: init value can be a ValueInfo structure with dtype, shape and… bluesky/ophyd#1194

Draft

danielballan mentioned this pull request Sep 20, 2024

Add dtype_numpy with medium-effort validation #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detailed data types #215

Add detailed data types #215

tacaswell commented Aug 30, 2021

Add detailed data types #215

Are you sure you want to change the base?

Add detailed data types #215

Conversation

tacaswell commented Aug 30, 2021

Description

Motivation and Context

How Has This Been Tested?

Docs

cross project work