Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Partitioned variable does not read in as the correct type #43303

Closed
prayaggordy opened this issue Jul 17, 2024 · 3 comments
Closed

[R] Partitioned variable does not read in as the correct type #43303

prayaggordy opened this issue Jul 17, 2024 · 3 comments

Comments

@prayaggordy
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I'm running arrow 16.1.0 with R 4.4.1 on a MacBook Pro M3, macOS 14.5, though this also occurred on Linux.

I partitioned my dataset by a string column that could be parsed to integer. When I read the dataset back into R with arrow::open_dataset, the partitioned column became an integer, even though I saved it as a string.

For example:

mtcars |>
  dplyr::mutate(cyl_ch = stringr::str_pad(cyl, 2, pad = "0"),
                gear_ch = stringr::str_pad(gear, 2, pad = "0")) |>
  dplyr::group_by(cyl_ch) |>
  arrow::write_dataset("output/partition_cyl_ch")

The resulting directory structure is:

output
  partition_cyl_ch
    cyl_ch=04
      part-0.parquet
    cyl_ch=06
      part-0.parquet
    cyl_ch=08
      part-0.parquet

If I run arrow::open_dataset("output/partition_cyl_ch"), the cyl_ch column is now an int32 (with values 4, 6, and 8 instead of "04", "06", and "08"), while the gear_ch column remains a string as intended.

I want the cyl_ch column to remain a string as well. There are no error messages, just an unexpected result.

It looks like the partitioned column itself (in this case, cyl_ch) isn't saved in the resulting parquet file but is instead inferred from the folder name; perhaps this is where the string is cast to an integer. For instance, if I directly read arrow::read_parquet("output/partition_cyl_ch/cyl_ch=04/part-0.parquet"), the cyl_ch column does not appear. There was a similar issue in the duckdb GitHub repository, but I can't find anything in the arrow repo.

Component(s)

R

@amoeba
Copy link
Member

amoeba commented Jul 18, 2024

Hi @prayaggordy, you're right that when a dataset is written with partitioning, the partition fields aren't stored in the files.

Arrow's partitioning approach does auto detection like other systems but allows you to provide a schema as an alternative which I think should get you what you want:

> my_schema <- schema(field("cyl_ch", string()))
> open_dataset("output/partition_cyl_ch", partitioning = my_schema)
FileSystemDataset with 3 Parquet files
13 columns
mpg: double
cyl: double
disp: double
hp: double
drat: double
wt: double
qsec: double
vs: double
am: double
gear: double
carb: double
gear_ch: string
cyl_ch: string

Would this work for your use case?

@prayaggordy
Copy link
Author

Hi @amoeba, thanks for the response. It would work in a small example, but I'm working with a large number of files. I can make do with setting the schema myself, but if there is any way to add stronger typing for the partition field I would be most grateful.

@thisisnic
Copy link
Member

I think the issue here is that the schema is the most reliable way of controlling the type of any variable, and it's inevitable that there will be issues when a data type has to be inferred and isn't provided explicitly. The undesirable behaviour here might be desirable behaviour for other users.

I think the solution provided by @amoeba is pretty solid, and a solution for automating it might be to write a wrapper function which looks up the partitioning columns and then generates a schema from that to then pass into open_dataset().

@thisisnic thisisnic changed the title Partitioned variable does not read in as the correct type [R] Partitioned variable does not read in as the correct type Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants