-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Reading Hive-style partitioned parquet files from GCS #30481
Comments
Joris Van den Bossche / @jorisvandenbossche: import pyarrow.parquet as pq
pq.read_table("path/to/partitioned/dataset/base/dir/partition_var=some_value/<data file>.parquet", filesystem=gcs) Can you try some things with the filesystem object and show what those return? gcs.isdir("path/to/partitioned/dataset/base/dir/")
gcs.exists("path/to/partitioned/dataset/base/dir/")
gcs.info("path/to/partitioned/dataset/base/dir/")
gcs.find("path/to/partitioned/dataset/base/dir/", maxdepth=None, withdirs=True, detail=True) |
@jorisvandenbossche We experience this same bug with PyArrow v11. Tested that the same partitioned directory works with:
Code exampleimport gcsfs
import pyarrow as pa
gcs = gcsfs.GCSFileSystem()
parquet_ds = pq.ParquetDataset("gs://<redacted-bucket-name>/partition/dir", filesystem=gcs) Traceback<venv>/lib/python3.10/site-packages/pyarrow/parquet/core.py:1763: in __new__
return _ParquetDatasetV2(
<venv>/lib/python3.10/site-packages/pyarrow/parquet/core.py:2477: in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
<venv>/lib/python3.10/site-packages/pyarrow/dataset.py:762: in dataset
return _filesystem_dataset(source, **kwargs)
<venv>/lib/python3.10/site-packages/pyarrow/dataset.py:453: in _filesystem_dataset
factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
pyarrow/_dataset.pyx:2236: in pyarrow._dataset.FileSystemDatasetFactory.__init__
???
pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet', which is outside base dir 'gs://<redacted-bucket-name>/partition/dir'
pyarrow/error.pxi:100: ArrowInvalid It appears to be simply caused by removal of the Details of filesystem object
|
We are also facing a similar issue. We have a Hive-style partitioned parquet dataset written with Spark. We cannot load it up with pyarrow (using gcsfs as the filesystem). Getting a FileNotFoundError when we run:
Error:
Can also confirm that the files do exist in GCS and we can load up individual files using |
Having same issue in Azure ML Studio reading spark DF created with single partition .../foo=1/. |
@felipecrv (tagging myself so I get this on my "Participating" inbox filter) |
we're also running into this. More absurdly, it's failing on the first read but a retry works every time. So
works... so bizarre Also this only happens when we
|
Bumping this |
Trying to read a spark-generated hive-style partitioned parquet dataset with gcsfs and {}pyarrow{}, but getting a FileNotFoundError if I try to read from the base directory or even if try to read directly from one of the partitions. Not sure if I am doing something wrong or it is not supported.
Note that I have successfully read this hive-style partitioned parquet dataset using other methods to rule out any other issues, including:
Successful read with pyspark by using spark.read.parquet
Successful read of a specific partition by passing list of paths to
ParquetDataset
Also, tested reading another spark-generated parquet dataset with no Hive-style partitions from GCS and that worked as well.
Below is what I am trying:
The errors returned for both are below:
Reporter: Garrett Weaver
Note: This issue was originally created as ARROW-14959. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: