Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Read GCS parquet file failed with pyarrow #44438

Open
yuqi1129 opened this issue Oct 16, 2024 · 0 comments
Open

[Python] Read GCS parquet file failed with pyarrow #44438

yuqi1129 opened this issue Oct 16, 2024 · 0 comments

Comments

@yuqi1129
Copy link

yuqi1129 commented Oct 16, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Requirement.txt

requests==2.32.2
dataclasses-json==0.6.6
readerwriterlock==1.0.9
fsspec==2024.9.0
pyarrow==16.1.0
cachetools==5.3.3
google-auth==2.35.0
from pyarrow.fs import GcsFileSystem
from fsspec.implementations.arrow import ArrowFSWrapper
import os
import pandas
import pyarrow.dataset as dt;
fileset_storage_location = "gs://xxxx/catalog/schema/fileset3"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "xxxxx.json"
selffs = ArrowFSWrapper(GcsFileSystem())
data = pandas.DataFrame({"Name": ["A", "B", "C", "D"], "ID": [20, 21, 19, 18]})
parquet_file = fileset_storage_location + "/test.parquet"
data.to_parquet(parquet_file, filesystem=selffs)
arrow_dataset = dt.dataset(parquet_file, filesystem=selffs)

We will run into the following message:

Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 794, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
File "pyarrow/io.pxi", line 341, in pyarrow.lib.NativeFile.seek
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: google::cloud::Status(OUT_OF_RANGE: Permanent error, with a last message of Request range not satisfiable error_info={reason=, domain=, metadata={gcloud-cpp.retry.function=ReadObjectNotWrapped, gcloud-cpp.retry.reason=permanent-error, gcloud-cpp.retry.original-message=Request range not satisfiable}})

If we switch the pyarrow version to:

fsspec==2024.3.1
pyarrow==15.0.2

then the error message will be:

Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 782, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 475, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 3025, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
File "pyarrow/io.pxi", line 328, in pyarrow.lib.NativeFile.seek
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: google::cloud::Status(OUT_OF_RANGE: Permanent error ReadObjectNotWrapped: Request range not satisfiable)

OS & python

(venv) [ec2-user@ip-111- client-python]$ python --version
Python 3.9.16
(venv) [ec2-user@ip-111-client-python]$ uname -a
Linux ip-xxxxx.ap-northeast-1.compute.internal 6.1.102-111.182.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Aug 13 22:23:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(venv) [ec2-user@ip-172-31-10-123 client-python

Component(s)

Python

@yuqi1129 yuqi1129 changed the title Read GCS parquet file failed with [Python] Read GCS parquet file failed with pyarrow Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant