Enhancements to GEMPAK reader(s) #2067

jthielen · 2021-08-30T21:37:25Z

Just documenting some desired enhancements to the GEMPAK io submodule brought up in the original PR, other issues, and side conversations, as well as some ideas I had while working on recent PRs:

Speed up snxarray (snxarray is slow #2062, resolved by Improve some thermo calculation bottlenecks by refactoring how units are handled #2064)
Get grid mapping information in the right place (GEMPAK reader assigns grid mapping information incorrectly #2046, resolved by Store grid mapping on metpy_crs coord rather than DataArray attrs in GEMPAK reader #2066)
Add support for GRIB2 packing reader
- Any way that https://github.com/ecmwf/cfgrib can be leveraged?
Add unit metadata to variables and coordinates in *xarray() methods (gdxarray() should assign units to x and y coordinates #2118, referenced in GEMPAK reader assigns grid mapping information incorrectly #2046)
Include lat/lon auxiliary coordinates on gdxarray() output
Consolidate xarray output where possible (multiple DataArrays into Dataset, multiple Datasets into single Dataset)
Support xarray backend API
- Need to provide subclasses of xarray.backends.BackendEntrypoint (actual backend) and xarray.backends.BackendArray (lazy loaded arrays)
- May require a substantial refactor
Add more CF metadata (to help with declarative / Automatic field calculation #3 / Systematic identification of variables from an xarray Dataset #886)
Clean up (and potentially speed up) interp functions using Option to return full (cumulative) profile from thickness_hydrostatic functions #1312

The text was updated successfully, but these errors were encountered:

sgdecker · 2021-09-22T15:55:03Z

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

sgdecker · 2021-10-13T13:38:46Z

Add unit metadata to variables and coordinates in *xarray() methods

A variation of this would be to allow pressures with units for the level keyword argument.
For instance:

p = 70000 * units('Pa')
hght = gem_file.gdxarray(parameter='HGHT', level=p)

would return the 700-mb heights rather than producing a TypeError as it does now.

nawendt · 2022-10-31T16:17:08Z

I've become more interested in combining multiple DataArray into a single Dataset, particularly with grids. I've only taken a cursory look at it. My struggle has been how to deal with combine vertical coordinate levels. It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate. I think (?) this is what is being done with xarray's own netCDF backend. There is logic in there that checks if things are "stackable". I'll need to understand how to apply that in this case to think of a solution. If anyone has any good ways to conceptualize this, let me know. The other challenge with combining DataArray will come from layer data (i.e., when GEMPAK has GLV1 and GLV2 defined for a grid). I am not sure you can combine that information into a single dimension. The current iteration just uses the first level information. You might need to have grids with another dimension that contains the second layer level.

sgdecker · 2022-11-02T19:43:20Z

It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate.

I think this is the crux of the required functionality, but with the extra complication that some variables are only provided on a subset of levels. For instance, you might have geopotential height every 50 mb, but absolute vorticity only at 850, 700, 500, and 300 mb. xarray's backend thus considers geopotential height and absolute vorticity to have two distinct vertical coordinates, and this leads to the explosion of coordinates that you see in, say, the xarray Dataset that corresponds to GFS model output pulled from a TDS. To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

dopplershift · 2022-11-02T20:36:22Z

To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

There are probably ways using lazy arrays to leave these missing chunks uninstantiated until they're needed (e.g. for plotting or calculations). Having said that, this would introduce a bunch of NaNs in e.g. profiles, plus it implies that data are available for these levels, only to be left missing. Dealing with this in code leaves a bunch of complications dealing with missing data. I'd personally opt for just having multiple coordinates and letting e.g. data.metpy.sel(vertical=500 * units.hPa) handle dealing with it.

nawendt · 2023-02-14T18:23:28Z

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

I was taking a closer look at decoding GRIB2 packing recently. It's definitely more complicated than the others to implement. That being said, it seems doable. GEMPAK has what looks like a fully functional GRIB2 decoder. It's possible that it could be pared down a bit if all we care about is just unpacking the data. The GEMPAK decoder also handles PNG and JPEG compressed data. I don't work with GRIB2 packed GEMPAK files, so I do not know how often those appear in the wild. Dealing with those compressions would introduce another layer of complication and perhaps some additional dependencies.

dopplershift · 2023-02-14T19:08:03Z

Dealing with PNG/JPEG compression (assuming it's not JPEG2k) should be doable using Pillow (which is already in our dependency chain thanks to Matplotlib. The hard part here really isn't decoding the grid, but interpreting all the other bits and bytes that amount to metadata.

I've got an old start on something to walk through GRIB messages:

log = logging.getLogger("metpy.io.grib")
#log.setLevel(logging.WARNING)


def _make_datetime(s):
    s = bytearray(s)
    year = (s[0] << 8) | s[1]
    return datetime(year, *s[2:])

section0 = NamedStruct([('grib', '4s'), ('reserved', 'H'), ('discipline', 'B'),
                        ('edition', 'B'), ('total_length', 'Q')], '>', 'Section0')

section_header =  NamedStruct([('section_len', 'I'), ('section_num', 'B')], '>', 'section_head')

class Descriptor(object):
    def __init__(self, fxy):
        pass

    def read(self, buf):
        pass

    
class Element(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_b[fxy]
        self.bits = int(self._info['BUFR_DataWidth_Bits'])

    def read(self, src):
        return src.read(self.bits)


class Replication(Descriptor):
    def __init__(self, descs, repeats):
        pass

    def read(self, src):
        return src.read(self.bits)

    
class Operator(Descriptor):
    def __init__(self, fxy):
        pass

    
class Sequence(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_d[fxy]
        self.bits = [int(bufr_table_b[int(i['FXY2'])]['BUFR_DataWidth_Bits']) for i in self._info]

    def read(self, src):
        return [src.read(b) for b in self.bits]

    
class GRIBMessage(object):
    def __init__(self, fobj):
        hdr = section0.unpack_file(fobj)
#        print(hdr)
        if hdr.grib != b'GRIB':
            logging.error('Bad section 0: %s', hdr)
            raise RuntimeError('Bad section 0: %s' % str(hdr))

        self._buffer = IOBuffer(fobj.read(hdr.total_length - section0.size))
        self.sections = [hdr]
        
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        end_bytes = self._buffer.read(4)
        if end_bytes != b'7777':
            logging.warning('Last bytes not correct: %s', end_bytes)

    def _read_section(self, jump_end=True):
        sec_start = self._buffer.set_mark()
        sec_hdr = self._buffer.read_struct(section_header)
        
        sec = getattr(self, '_read_section{:d}'.format(sec_hdr.section_num))()
        self.sections.append(sec)
        if jump_end:
            self._buffer.jump_to(sec_start, sec_hdr.section_len)
        return sec

    section1 = NamedStruct([('center', 'H'),
                            ('subcenter', 'H'), ('master_table_version', 'B'), ('local_table_version', 'B'),
                            ('ref_time_significance', 'B'), ('dt', '7s', _make_datetime),
                            ('production_status', 'B'), ('data_type', 'B')], '>', 'Section1')
    #                        ('template_id', 'H')], '>', 'Section1')
    def _read_section1(self):
        return self._buffer.read_struct(self.section1)

    def _read_section2(self):
        raise NotImplementedError('Local section not handled')

    section3 = NamedStruct([('grid_def_source', 'B'), ('num_data_points', 'I'), ('num_octets', 'B'),
                            ('interp', 'B'), ('template_num', 'H')], '>', 'Section3')

    def _read_section3(self):
        return self._buffer.read_struct(self.section3)
    
    section4 = NamedStruct([('num_values', 'H'), ('prod_template', 'H')], '>', 'Section4')
    def _read_section4(self):
        return self._buffer.read_struct(self.section4)

    section5 = NamedStruct([('num_values', 'I'), ('data_template', 'H')], '>', 'Section5')
    def _read_section5(self):
        return self._buffer.read_struct(self.section5)

    section6 = NamedStruct([('indicator', 'B')], '>', 'Section6')
    def _read_section6(self):
        return self._buffer.read_struct(self.section6)

    def _read_section7(self):
        self._data_ptr = self._buffer._offset

nawendt · 2023-02-14T20:02:06Z

It does appear to be JPEG2k.

dopplershift · 2023-02-14T20:58:13Z

I kinda figured since that's why we were dealing with the JJ2000 library over on the netcdf-java side. Looks like Pillow can support jpeg2k, not sure how regularly it's included in builds. We can cross that bridge when we get to it I guess.

jthielen added Type: Enhancement Enhancement to existing functionality Area: IO Pertains to reading data labels Aug 30, 2021

nawendt mentioned this issue Jan 27, 2023

Fix bug in GempakGrid #2904

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements to GEMPAK reader(s) #2067

Enhancements to GEMPAK reader(s) #2067

jthielen commented Aug 30, 2021 •

edited

Loading

sgdecker commented Sep 22, 2021

sgdecker commented Oct 13, 2021

nawendt commented Oct 31, 2022

sgdecker commented Nov 2, 2022

dopplershift commented Nov 2, 2022

nawendt commented Feb 14, 2023

dopplershift commented Feb 14, 2023

nawendt commented Feb 14, 2023

dopplershift commented Feb 14, 2023

Enhancements to GEMPAK reader(s) #2067

Enhancements to GEMPAK reader(s) #2067

Comments

jthielen commented Aug 30, 2021 • edited Loading

sgdecker commented Sep 22, 2021

sgdecker commented Oct 13, 2021

nawendt commented Oct 31, 2022

sgdecker commented Nov 2, 2022

dopplershift commented Nov 2, 2022

nawendt commented Feb 14, 2023

dopplershift commented Feb 14, 2023

nawendt commented Feb 14, 2023

dopplershift commented Feb 14, 2023

jthielen commented Aug 30, 2021 •

edited

Loading