Maintain raw precision instead of down casting to float32 #395

juntyr · 2024-07-22T09:35:53Z

This is a fix by @dmey to avoid losing precision when using cfgrib, which David developed while working on the field compression project, which I’ve taken up.

FussyDuck · 2024-07-22T09:35:58Z

All committers have signed the CLA.

juntyr · 2024-07-22T09:41:07Z

@dmey I hope you’re ok with me wanting to upstream your fix, as it’s greatly benefited me over the past year. Btw, do you need to sign the CLA as well?

dmey · 2024-07-22T18:25:05Z

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

juntyr · 2024-07-22T18:47:53Z

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

I signed it but the action still shows it as unsigned, so perhaps you need to sign it too?

dmey · 2024-07-22T20:10:04Z

it should be fine now

iainrussell · 2024-07-30T08:38:00Z

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Potential API:

ds = xr.open_dataset('data.grib', engine='cfgrib', backend_kwargs={'values_dtype': 'float64'})

Do you think you could make this modification to the PR, or would you rather leave it as a request issue?

juntyr · 2024-07-30T09:32:41Z

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Good point! What about the new version, where the dtype is inferred from the GRIB messages, so that cfgrib just exposes whatever dtype the GRIB file uses?

iainrussell · 2024-07-30T10:38:32Z

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

juntyr · 2024-07-30T12:07:15Z

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

Would there be a way to ask GRIB to return the value in the closest matching numpy dtype (for now just f32 and f64) or to have this functionality inside cfgrib to query the number of bits and choose the closest dtype to request from eccodes based on that?

iainrussell · 2024-07-30T14:28:29Z

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

juntyr · 2024-07-30T15:13:13Z

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

Thanks for your explanation! In that case, I agree that your proposed API would work best, and I can implement it.

In my use case, evaluating different compression strategies, I do you want to know what the native precision of the GRIB file is. Is there a better way than loading the data in flat64 precision, then casting to float32 and checking if any precision was lost?

iainrussell · 2024-07-30T15:21:52Z

Well, you must have ecCodes installed on your system if you have cfgrib there, so this on the command-line can help, if you're feeling strong enough to handle the gory details:

grib_dump -O gfs.t00z.pgrb2.0p25.f000

Look at Section 5 of the output and you will see something like this:

======================   SECTION_5 ( length=49, padding=0 )    ======================
1-4       section5Length = 49
5         numberOfSection = 5
6-9       numberOfValues = 1038240
10-11     dataRepresentationTemplateNumber = 3 [Grid point data - complex packing and spatial differencing (grib2/tables/2/5.0.table) ]
12-15     referenceValue = 2147.02
16-17     binaryScaleFactor = 0
18-19     decimalScaleFactor = 1
20        bitsPerValue = 9
21        typeOfOriginalFieldValues = 0 [Floating point (grib2/tables/2/5.1.table) ]
22        groupSplittingMethodUsed = 1 [General group splitting (grib2/tables/2/5.4.table) ]
23        missingValueManagementUsed = 0 [No explicit missing values included within data values (grib2/tables/2/5.5.table) ]
24-27     primaryMissingValueSubstitute = 1649987994
28-31     secondaryMissingValueSubstitute = 4294967295
32-35     numberOfGroupsOfDataValues = 32875
36        referenceForGroupWidths = 0
37        numberOfBitsUsedForTheGroupWidths = 4
38-41     referenceForGroupLengths = 1
42        lengthIncrementForTheGroupLengths = 1
43-46     trueLengthOfLastGroup = 41
47        numberOfBitsForScaledGroupLengths = 7
48        orderOfSpatialDifferencing = 2 [Second-order spatial differencing (grib2/tables/2/5.6.table) ]
49        numberOfOctetsExtraDescriptors = 2

This is a file downloaded from where you pointed me to. You can see that it uses 9 bits per value, but on top of that it has ccsds compression applied. You're probably sorry you asked now :)

github-actions bot added the contributor label Jul 22, 2024

cast to float64

598baf3

juntyr force-pushed the float64 branch from 941df70 to 598baf3 Compare July 22, 2024 09:39

Determine dtype from grib messages

93e7f62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain raw precision instead of down casting to float32 #395

Maintain raw precision instead of down casting to float32 #395

juntyr commented Jul 22, 2024 •

edited

Loading

FussyDuck commented Jul 22, 2024 •

edited

Loading

juntyr commented Jul 22, 2024

dmey commented Jul 22, 2024

juntyr commented Jul 22, 2024

dmey commented Jul 22, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

Maintain raw precision instead of down casting to float32 #395

Are you sure you want to change the base?

Maintain raw precision instead of down casting to float32 #395

Conversation

juntyr commented Jul 22, 2024 • edited Loading

FussyDuck commented Jul 22, 2024 • edited Loading

juntyr commented Jul 22, 2024

dmey commented Jul 22, 2024

juntyr commented Jul 22, 2024

dmey commented Jul 22, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 30, 2024

iainrussell commented Jul 30, 2024

juntyr commented Jul 22, 2024 •

edited

Loading

FussyDuck commented Jul 22, 2024 •

edited

Loading