Parsing local and cloud SEG-Y files with new I/O library (#381)

* Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor workers for SEG-Y parsing Simplify `header_scan_worker` and `trace_worker` in SEG-Y module by removing unused imports and streamlining parameter list. Update functions to work directly with `SegyFile` instances and clean up data handling logic for efficiency. * Refactor SEG-Y parser and streamline imports Refactor the parsing functions in `src/mdio/segy/parsers.py` to simplify the codebase and improve maintainability. Redundant functions such as `parse_binary_header`, `parse_text_header`, and `get_trace_count` have been removed, while imports have been condensed to only essential modules. The `NUM_CORES` logic is updated to count logical cores instead of just physical ones. * Refactor SEG-Y converter and simplify imports Removed unused imports and functions in the SEG-Y converter module to enhance code maintainability. Simplified the arguments for the `segy_to_mdio` function to increase ease of use and readability. Reduced complexity by utilizing `SegyFile` class for SEG-Y file operations. * Refactor get_grid_plan and remove unused imports The get_grid_plan function in utilities.py has been refactored to accept a SegyFile instance instead of individual parameters for the file path. Unused imports were eliminated, and type checking imports are now conditional, improving readability and modularity. * use NDArray typing since we now return struct * Refactor to use 'segy' instead of 'segyio'. The changes involve major refactoring of the code base to use the 'segy' library instead of 'segyio'. Most notably, this included updating the handling of SEG-Y dtypes, byte order, and trace headers. Unused imports have been removed to clean up the code. A new multiprocessing chunk size has been introduced and set attributes to SegyFile instance instead of passing them as function arguments. * refactor override tests to use ndarray headers instead of a dictionary to make it work with 'segy'. * Remove unit tests for IBM/IEEE conversions and text headers * Refactor and simplify 6D tests related to SEG-Y * Refactor and simplify 6D tests related to SEG-Y * Upgrade segy package version The segy package version has been updated from 0.0.13 to 0.0.14 in the pyproject.toml file. This upgrade was performed to update software dependencies and to integrate the latest bug fixes and features delivered with the new version. * Refactored segy factory creation in mdio_to_segy function A new helper function, 'make_segy_factory', has been created to handle the generation of SegyFactory. This function accepts more parameters to provide better control over the creation of the SEG-Y based on the MDIO metadata. Changes also include updates in import declarations and reorganization of some code blocks in the 'mdio_spec_to_segy' function. * Update segy library version * Multiply sample_interval by 1000 in SegyFactory In the SegyFactory initialization within creation.py, the sample_interval parameter has been modified to be multiplied by 1000. This change ensures that the value is correctly represented in microseconds, aligning with the expected data format. * fix docstring errors * Update dependency package versions * update field name for segy data * import Endianness from new location * use bleeding edge segy during dev * allow configuring endianness on export * update binary header * Update the 'segy' git repository link * Update virtualenv version in constraints.txt * Update poetry version in workflow constraints * update RtD dependencies * switch myst-nb to stable * fix broken tests * fix broken tests * simplify factory usage and fix tests * add original segyio fields as spec * Add pytest-dependency to project dev dependencies * fix: headers were missed due to early return * streamline mdio segy spec * simplify mock 4d generation * enforce mdio segy spec * update type hints to the correct segy type. * revert api * update type hints * remove endian from segy import because its inferred * remove output format from seg-y export. we only export as its set in "binary header" * update endian kwarg name * revert to old api * enable all tests * Update get_grid_plan * Remove unused byte swapping function from segy creation module. * Remove now unused byte utils module * Add temporary safety check ignore for specific CVE The safety check in noxfile.py has been updated to temporarily ignore a specific Common Vulnerabilities and Exposures (CVE) number because it's not deemed critical. A TODO note is added to remind removal of this exception once the issue is resolved. * fix safety ignore syntax * make temp zarr files module scoped * revert to_segy endian api * simplify changes * Correct variable in default chunk selection * Update segy package version in pyproject.toml * use correct spec for factory * use new endian inference from `segy` * get header dtype from spec instead of reading a header * remove unnecessary cast * remove commented line * Implement dynamic CPU count for header parsing * backward_compat: revert text header to write as list[str] instead of str with newline * generate spec as needed and avoid singleton bugs * bump version * add missing return doc * Add future annotations import for type hints
TGSAI · Jun 25, 2024 · f4a5ad4 · f4a5ad4
1 parent 6bb1956
commit f4a5ad4
Show file tree

Hide file tree

Showing 29 changed files with 1,577 additions and 2,306 deletions.
diff --git a/.github/workflows/constraints-poetry.txt b/.github/workflows/constraints-poetry.txt
@@ -1 +1 @@
-poetry==1.8.2
+poetry==1.8.3
diff --git a/.github/workflows/constraints.txt b/.github/workflows/constraints.txt
@@ -1,4 +1,4 @@
 pip==24.0
 nox==2024.4.15
 nox-poetry==1.0.3
-virtualenv==20.26.1
+virtualenv==20.26.2
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,7 +1,6 @@
 furo==2024.5.6
 sphinx==7.3.7
-sphinx-click==5.1.0
+sphinx-click==6.0.0
 sphinx-copybutton==0.5.2
-# myst-nb==0.17.2
-myst-nb @ git+https://github.com/executablebooks/MyST-NB@35ebd54
+myst-nb==1.1.0
 linkify-it-py==2.0.3
diff --git a/noxfile.py b/noxfile.py
@@ -144,7 +144,15 @@ def safety(session: Session) -> None:
     """Scan dependencies for insecure packages."""
     requirements = session.poetry.export_requirements()
     session.install("safety")
-    session.run("safety", "check", "--full-report", f"--file={requirements}")
+    # TODO(Altay): Remove the CVE ignore once its resolved. Its not critical, so ignoring now.
+    ignore = ["70612"]
+    session.run(
+        "safety",
+        "check",
+        "--full-report",
+        f"--file={requirements}",
+        f"--ignore={','.join(ignore)}",
+    )
 
 
 @session(python=python_versions)
@@ -219,9 +227,7 @@ def docs_build(session: Session) -> None:
         "sphinx-click",
         "sphinx-copybutton",
         "furo",
-        # TODO(Altay): Update this to v1.0.0 when its out. Right now we
-        #  use this because myst-nb stable doesn't work with Sphinx 7.
-        "myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
+        "myst-nb",
         "linkify-it-py",
     )
 
@@ -243,9 +249,7 @@ def docs(session: Session) -> None:
         "sphinx-click",
         "sphinx-copybutton",
         "furo",
-        # TODO(Altay): Update this to v1.0.0 when its out. Right now we
-        #  use this because myst-nb stable doesn't work with Sphinx 7.
-        "myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
+        "myst-nb",
         "linkify-it-py",
     )
 

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "multidimio"
-version = "0.7.4"
+version = "0.8.0"
 description = "Cloud-native, scalable, and user-friendly multi dimensional energy data!"
 authors = ["TGS <[email protected]>"]
 maintainers = [
@@ -26,22 +26,21 @@ Changelog = "https://github.com/TGSAI/mdio-python/releases"
 python = ">=3.9,<3.13"
 click = "^8.1.7"
 click-params = "^0.5.0"
-zarr = "^2.16.1"
-dask = ">=2023.10.0"
-tqdm = "^4.66.1"
-segyio = "^1.9.3"
-numba = "^0.59.1"
-psutil = "^5.9.5"
-fsspec = ">=2023.9.1"
+zarr = "^2.18.2"
+dask = ">=2024.6.1"
+tqdm = "^4.66.4"
+psutil = "^6.0.0"
+fsspec = ">=2024.6.0"
+segy = "^0.1.4"
 rich = "^13.7.1"
 urllib3 = "^1.26.18" # Workaround for poetry-plugin-export/issues/183
 
 # Extras
-distributed = {version = ">=2023.10.0", optional = true}
-bokeh = {version = "^3.2.2", optional = true}
-s3fs = {version = ">=2023.5.0", optional = true}
-gcsfs = {version = ">=2023.5.0", optional = true}
-adlfs = {version = ">=2023.4.0", optional = true}
+distributed = {version = ">=2024.6.1", optional = true}
+bokeh = {version = "^3.4.1", optional = true}
+s3fs = {version = ">=2024.6.0", optional = true}
+gcsfs = {version = ">=2024.6.0", optional = true}
+adlfs = {version = ">=2024.4.1", optional = true}
 zfpy = {version = "^0.5.5", optional = true}
 
 [tool.poetry.extras]
@@ -51,30 +50,31 @@ lossy = ["zfpy"]
 
 [tool.poetry.group.dev.dependencies]
 black = "^24.4.2"
-coverage = {version = "^7.4.0", extras = ["toml"]}
+coverage = {version = "^7.5.3", extras = ["toml"]}
 darglint = "^1.8.1"
-flake8 = "^7.0.0"
+flake8 = "^7.1.0"
 flake8-bandit = "^4.1.1"
-flake8-bugbear = "^23.12.2"
+flake8-bugbear = "^24.4.26"
 flake8-docstrings = "^1.7.0"
 flake8-rst-docstrings = "^0.3.0"
-furo = ">=2023.9.10"
+furo = ">=2024.5.6"
 isort = "^5.13.2"
-mypy = "^1.8.0"
-pep8-naming = "^0.13.3"
-pre-commit = "^3.6.0"
-pre-commit-hooks = "^4.5.0"
-pytest = "^7.4.4"
-pyupgrade = "^3.15.0"
-safety = "^2.3.5"
-sphinx-autobuild = "^2021.3.14"
-sphinx-click = "^5.1.0"
+mypy = "^1.10.0"
+pep8-naming = "^0.14.1"
+pre-commit = "^3.7.1"
+pre-commit-hooks = "^4.6.0"
+pytest = "^8.2.2"
+pytest-dependency = "^0.6.0"
+pyupgrade = "^3.16.0"
+safety = "^3.2.3"
+sphinx-autobuild = ">=2024.4.16"
+sphinx-click = "^6.0.0"
 sphinx-copybutton = "^0.5.2"
-typeguard = "^4.1.5"
-xdoctest = {version = "^1.1.2", extras = ["colors"]}
-myst-parser = "^2.0.0"
-Pygments = "^2.17.2"
-Sphinx = "^7.2.6"
+typeguard = "^4.3.0"
+xdoctest = {version = "^1.1.5", extras = ["colors"]}
+myst-parser = "^3.0.1"
+Pygments = "^2.18.0"
+Sphinx = "^7.3.7"
 
 [tool.poetry.scripts]
 mdio = "mdio.__main__:main"

diff --git a/src/mdio/commands/segy.py b/src/mdio/commands/segy.py
@@ -96,16 +96,6 @@
     help="Custom chunk size for bricked storage",
     type=IntListParamType(),
 )
-@option(
-    "-endian",
-    "--endian",
-    required=False,
-    default="big",
-    help="Endianness of the SEG-Y file",
-    type=Choice(["little", "big"]),
-    show_default=True,
-    show_choices=True,
-)
 @option(
     "-lossless",
     "--lossless",
@@ -152,7 +142,6 @@ def segy_import(
     header_types: list[str],
     header_names: list[str],
     chunk_size: list[int],
-    endian: str,
     lossless: bool,
     compression_tolerance: float,
     storage_options: dict[str, Any],
@@ -356,7 +345,6 @@ def segy_import(
         index_types=header_types,
         index_names=header_names,
         chunksize=chunk_size,
-        endian=endian,
         lossless=lossless,
         compression_tolerance=compression_tolerance,
         storage_options=storage_options,
@@ -377,16 +365,6 @@ def segy_import(
     type=STRING,
     show_default=True,
 )
-@option(
-    "-format",
-    "--segy-format",
-    required=False,
-    default="ibm32",
-    help="SEG-Y sample format",
-    type=Choice(["ibm32", "ieee32"]),
-    show_default=True,
-    show_choices=True,
-)
 @option(
     "-storage",
     "--storage-options",
@@ -408,7 +386,6 @@ def segy_export(
     mdio_file: str,
     segy_path: str,
     access_pattern: str,
-    segy_format: str,
     storage_options: dict[str, Any],
     endian: str,
 ):
@@ -438,7 +415,6 @@ def segy_export(
         mdio_path_or_buffer=mdio_file,
         output_segy_path=segy_path,
         access_pattern=access_pattern,
-        out_sample_format=segy_format,
         storage_options=storage_options,
         endian=endian,
     )
diff --git a/src/mdio/converters/mdio.py b/src/mdio/converters/mdio.py
@@ -12,8 +12,6 @@
 
 from mdio import MDIOReader
 from mdio.segy.blocked_io import to_segy
-from mdio.segy.byte_utils import ByteOrder
-from mdio.segy.byte_utils import Dtype
 from mdio.segy.creation import concat_files
 from mdio.segy.creation import mdio_spec_to_segy
 from mdio.segy.utilities import segy_export_rechunker
@@ -34,7 +32,6 @@ def mdio_to_segy(  # noqa: C901
     output_segy_path: str,
     endian: str = "big",
     access_pattern: str = "012",
-    out_sample_format: str = "ibm32",
     storage_options: dict = None,
     new_chunks: tuple[int, ...] = None,
     selection_mask: np.ndarray = None,
@@ -65,8 +62,6 @@ def mdio_to_segy(  # noqa: C901
             endian. Default is 'big'.
         access_pattern: This specificies the chunk access pattern. Underlying
             zarr.Array must exist. Examples: '012', '01'
-        out_sample_format: Output sample format.
-            Currently support: {'ibm32', 'float32'}. Default is 'ibm32'.
         storage_options: Storage options for the cloud storage backend.
             Default: None (will assume anonymous access)
         new_chunks: Set manual chunksize. For development purposes only.
@@ -99,7 +94,6 @@ def mdio_to_segy(  # noqa: C901
         ...     mdio_path_or_buffer="prefix2/file.mdio",
         ...     output_segy_path="prefix/file.segy",
         ...     selection_mask=boolean_mask,
-        ...     out_sample_format="float32",
         ... )
 
     """
@@ -117,25 +111,23 @@ def mdio_to_segy(  # noqa: C901
     creation_args = [
         mdio_path_or_buffer,
         output_segy_path,
-        endian,
         access_pattern,
-        out_sample_format,
+        endian,
         storage_options,
         new_chunks,
-        selection_mask,
         backend,
     ]
 
     if client is not None:
         if distributed is not None:
             # This is in case we work with big data
             feature = client.submit(mdio_spec_to_segy, *creation_args)
-            mdio, sample_format = feature.result()
+            mdio, segy_factory = feature.result()
         else:
             msg = "Distributed client was provided, but `distributed` is not installed"
             raise ImportError(msg)
     else:
-        mdio, sample_format = mdio_spec_to_segy(*creation_args)
+        mdio, segy_factory = mdio_spec_to_segy(*creation_args)
 
     live_mask = mdio.live_mask.compute()
 
@@ -163,10 +155,6 @@ def mdio_to_segy(  # noqa: C901
         selection_mask = selection_mask[dim_slices]
         live_mask = live_mask & selection_mask
 
-    # Parse output type and byte order
-    out_dtype = Dtype[out_sample_format.upper()]
-    out_byteorder = ByteOrder[endian.upper()]
-
     # tmp file root
     out_dir = path.dirname(output_segy_path)
     tmp_dir = TemporaryDirectory(dir=out_dir)
@@ -177,8 +165,7 @@ def mdio_to_segy(  # noqa: C901
                 samples=samples,
                 headers=headers,
                 live_mask=live_mask,
-                out_dtype=out_dtype,
-                out_byteorder=out_byteorder,
+                segy_factory=segy_factory,
                 file_root=tmp_dir.name,
                 axis=tuple(range(1, samples.ndim)),
             )