Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to GEMPAK reader(s) #2067

Open
2 of 9 tasks
jthielen opened this issue Aug 30, 2021 · 9 comments
Open
2 of 9 tasks

Enhancements to GEMPAK reader(s) #2067

jthielen opened this issue Aug 30, 2021 · 9 comments
Labels
Area: IO Pertains to reading data Type: Enhancement Enhancement to existing functionality

Comments

@jthielen
Copy link
Collaborator

jthielen commented Aug 30, 2021

Just documenting some desired enhancements to the GEMPAK io submodule brought up in the original PR, other issues, and side conversations, as well as some ideas I had while working on recent PRs:

@jthielen jthielen added Type: Enhancement Enhancement to existing functionality Area: IO Pertains to reading data labels Aug 30, 2021
@sgdecker
Copy link
Contributor

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

@sgdecker
Copy link
Contributor

Add unit metadata to variables and coordinates in *xarray() methods

A variation of this would be to allow pressures with units for the level keyword argument.
For instance:

p = 70000 * units('Pa')
hght = gem_file.gdxarray(parameter='HGHT', level=p)

would return the 700-mb heights rather than producing a TypeError as it does now.

@nawendt
Copy link
Contributor

nawendt commented Oct 31, 2022

I've become more interested in combining multiple DataArray into a single Dataset, particularly with grids. I've only taken a cursory look at it. My struggle has been how to deal with combine vertical coordinate levels. It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate. I think (?) this is what is being done with xarray's own netCDF backend. There is logic in there that checks if things are "stackable". I'll need to understand how to apply that in this case to think of a solution. If anyone has any good ways to conceptualize this, let me know. The other challenge with combining DataArray will come from layer data (i.e., when GEMPAK has GLV1 and GLV2 defined for a grid). I am not sure you can combine that information into a single dimension. The current iteration just uses the first level information. You might need to have grids with another dimension that contains the second layer level.

@sgdecker
Copy link
Contributor

sgdecker commented Nov 2, 2022

It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate.

I think this is the crux of the required functionality, but with the extra complication that some variables are only provided on a subset of levels. For instance, you might have geopotential height every 50 mb, but absolute vorticity only at 850, 700, 500, and 300 mb. xarray's backend thus considers geopotential height and absolute vorticity to have two distinct vertical coordinates, and this leads to the explosion of coordinates that you see in, say, the xarray Dataset that corresponds to GFS model output pulled from a TDS. To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

@dopplershift
Copy link
Member

To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

There are probably ways using lazy arrays to leave these missing chunks uninstantiated until they're needed (e.g. for plotting or calculations). Having said that, this would introduce a bunch of NaNs in e.g. profiles, plus it implies that data are available for these levels, only to be left missing. Dealing with this in code leaves a bunch of complications dealing with missing data. I'd personally opt for just having multiple coordinates and letting e.g. data.metpy.sel(vertical=500 * units.hPa) handle dealing with it.

@nawendt nawendt mentioned this issue Jan 27, 2023
1 task
@nawendt
Copy link
Contributor

nawendt commented Feb 14, 2023

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

I was taking a closer look at decoding GRIB2 packing recently. It's definitely more complicated than the others to implement. That being said, it seems doable. GEMPAK has what looks like a fully functional GRIB2 decoder. It's possible that it could be pared down a bit if all we care about is just unpacking the data. The GEMPAK decoder also handles PNG and JPEG compressed data. I don't work with GRIB2 packed GEMPAK files, so I do not know how often those appear in the wild. Dealing with those compressions would introduce another layer of complication and perhaps some additional dependencies.

@dopplershift
Copy link
Member

Dealing with PNG/JPEG compression (assuming it's not JPEG2k) should be doable using Pillow (which is already in our dependency chain thanks to Matplotlib. The hard part here really isn't decoding the grid, but interpreting all the other bits and bytes that amount to metadata.

I've got an old start on something to walk through GRIB messages:

log = logging.getLogger("metpy.io.grib")
#log.setLevel(logging.WARNING)


def _make_datetime(s):
    s = bytearray(s)
    year = (s[0] << 8) | s[1]
    return datetime(year, *s[2:])

section0 = NamedStruct([('grib', '4s'), ('reserved', 'H'), ('discipline', 'B'),
                        ('edition', 'B'), ('total_length', 'Q')], '>', 'Section0')

section_header =  NamedStruct([('section_len', 'I'), ('section_num', 'B')], '>', 'section_head')

class Descriptor(object):
    def __init__(self, fxy):
        pass

    def read(self, buf):
        pass

    
class Element(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_b[fxy]
        self.bits = int(self._info['BUFR_DataWidth_Bits'])

    def read(self, src):
        return src.read(self.bits)


class Replication(Descriptor):
    def __init__(self, descs, repeats):
        pass

    def read(self, src):
        return src.read(self.bits)

    
class Operator(Descriptor):
    def __init__(self, fxy):
        pass

    
class Sequence(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_d[fxy]
        self.bits = [int(bufr_table_b[int(i['FXY2'])]['BUFR_DataWidth_Bits']) for i in self._info]

    def read(self, src):
        return [src.read(b) for b in self.bits]

    
class GRIBMessage(object):
    def __init__(self, fobj):
        hdr = section0.unpack_file(fobj)
#        print(hdr)
        if hdr.grib != b'GRIB':
            logging.error('Bad section 0: %s', hdr)
            raise RuntimeError('Bad section 0: %s' % str(hdr))

        self._buffer = IOBuffer(fobj.read(hdr.total_length - section0.size))
        self.sections = [hdr]
        
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        end_bytes = self._buffer.read(4)
        if end_bytes != b'7777':
            logging.warning('Last bytes not correct: %s', end_bytes)

    def _read_section(self, jump_end=True):
        sec_start = self._buffer.set_mark()
        sec_hdr = self._buffer.read_struct(section_header)
        
        sec = getattr(self, '_read_section{:d}'.format(sec_hdr.section_num))()
        self.sections.append(sec)
        if jump_end:
            self._buffer.jump_to(sec_start, sec_hdr.section_len)
        return sec

    section1 = NamedStruct([('center', 'H'),
                            ('subcenter', 'H'), ('master_table_version', 'B'), ('local_table_version', 'B'),
                            ('ref_time_significance', 'B'), ('dt', '7s', _make_datetime),
                            ('production_status', 'B'), ('data_type', 'B')], '>', 'Section1')
    #                        ('template_id', 'H')], '>', 'Section1')
    def _read_section1(self):
        return self._buffer.read_struct(self.section1)

    def _read_section2(self):
        raise NotImplementedError('Local section not handled')

    section3 = NamedStruct([('grid_def_source', 'B'), ('num_data_points', 'I'), ('num_octets', 'B'),
                            ('interp', 'B'), ('template_num', 'H')], '>', 'Section3')

    def _read_section3(self):
        return self._buffer.read_struct(self.section3)
    
    section4 = NamedStruct([('num_values', 'H'), ('prod_template', 'H')], '>', 'Section4')
    def _read_section4(self):
        return self._buffer.read_struct(self.section4)

    section5 = NamedStruct([('num_values', 'I'), ('data_template', 'H')], '>', 'Section5')
    def _read_section5(self):
        return self._buffer.read_struct(self.section5)

    section6 = NamedStruct([('indicator', 'B')], '>', 'Section6')
    def _read_section6(self):
        return self._buffer.read_struct(self.section6)

    def _read_section7(self):
        self._data_ptr = self._buffer._offset

@nawendt
Copy link
Contributor

nawendt commented Feb 14, 2023

It does appear to be JPEG2k.

@dopplershift
Copy link
Member

I kinda figured since that's why we were dealing with the JJ2000 library over on the netcdf-java side. Looks like Pillow can support jpeg2k, not sure how regularly it's included in builds. We can cross that bridge when we get to it I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: IO Pertains to reading data Type: Enhancement Enhancement to existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants