Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain raw precision instead of down casting to float32 #395

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

juntyr
Copy link

@juntyr juntyr commented Jul 22, 2024

This is a fix by @dmey to avoid losing precision when using cfgrib, which David developed while working on the field compression project, which I’ve taken up.

@FussyDuck
Copy link

FussyDuck commented Jul 22, 2024

CLA assistant check
All committers have signed the CLA.

@juntyr
Copy link
Author

juntyr commented Jul 22, 2024

@dmey I hope you’re ok with me wanting to upstream your fix, as it’s greatly benefited me over the past year. Btw, do you need to sign the CLA as well?

@dmey
Copy link

dmey commented Jul 22, 2024

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

@juntyr
Copy link
Author

juntyr commented Jul 22, 2024

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

I signed it but the action still shows it as unsigned, so perhaps you need to sign it too?

@dmey
Copy link

dmey commented Jul 22, 2024

it should be fine now

@iainrussell
Copy link
Member

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Potential API:

ds = xr.open_dataset('data.grib', engine='cfgrib', backend_kwargs={'values_dtype': 'float64'})

Do you think you could make this modification to the PR, or would you rather leave it as a request issue?

@juntyr
Copy link
Author

juntyr commented Jul 30, 2024

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Good point! What about the new version, where the dtype is inferred from the GRIB messages, so that cfgrib just exposes whatever dtype the GRIB file uses?

@iainrussell
Copy link
Member

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

@juntyr
Copy link
Author

juntyr commented Jul 30, 2024

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

Would there be a way to ask GRIB to return the value in the closest matching numpy dtype (for now just f32 and f64) or to have this functionality inside cfgrib to query the number of bits and choose the closest dtype to request from eccodes based on that?

@iainrussell
Copy link
Member

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

@juntyr
Copy link
Author

juntyr commented Jul 30, 2024

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

Thanks for your explanation! In that case, I agree that your proposed API would work best, and I can implement it.

In my use case, evaluating different compression strategies, I do you want to know what the native precision of the GRIB file is. Is there a better way than loading the data in flat64 precision, then casting to float32 and checking if any precision was lost?

@iainrussell
Copy link
Member

Well, you must have ecCodes installed on your system if you have cfgrib there, so this on the command-line can help, if you're feeling strong enough to handle the gory details:

grib_dump -O gfs.t00z.pgrb2.0p25.f000

Look at Section 5 of the output and you will see something like this:

======================   SECTION_5 ( length=49, padding=0 )    ======================
1-4       section5Length = 49
5         numberOfSection = 5
6-9       numberOfValues = 1038240
10-11     dataRepresentationTemplateNumber = 3 [Grid point data - complex packing and spatial differencing (grib2/tables/2/5.0.table) ]
12-15     referenceValue = 2147.02
16-17     binaryScaleFactor = 0
18-19     decimalScaleFactor = 1
20        bitsPerValue = 9
21        typeOfOriginalFieldValues = 0 [Floating point (grib2/tables/2/5.1.table) ]
22        groupSplittingMethodUsed = 1 [General group splitting (grib2/tables/2/5.4.table) ]
23        missingValueManagementUsed = 0 [No explicit missing values included within data values (grib2/tables/2/5.5.table) ]
24-27     primaryMissingValueSubstitute = 1649987994
28-31     secondaryMissingValueSubstitute = 4294967295
32-35     numberOfGroupsOfDataValues = 32875
36        referenceForGroupWidths = 0
37        numberOfBitsUsedForTheGroupWidths = 4
38-41     referenceForGroupLengths = 1
42        lengthIncrementForTheGroupLengths = 1
43-46     trueLengthOfLastGroup = 41
47        numberOfBitsForScaledGroupLengths = 7
48        orderOfSpatialDifferencing = 2 [Second-order spatial differencing (grib2/tables/2/5.6.table) ]
49        numberOfOctetsExtraDescriptors = 2

This is a file downloaded from where you pointed me to. You can see that it uses 9 bits per value, but on top of that it has ccsds compression applied. You're probably sorry you asked now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants