You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
I'm running arrow 16.1.0 with R 4.4.1 on a MacBook Pro M3, macOS 14.5, though this also occurred on Linux.
I partitioned my dataset by a string column that could be parsed to integer. When I read the dataset back into R with arrow::open_dataset, the partitioned column became an integer, even though I saved it as a string.
For example:
mtcars |>
dplyr::mutate(cyl_ch = stringr::str_pad(cyl, 2, pad = "0"),
gear_ch = stringr::str_pad(gear, 2, pad = "0")) |>
dplyr::group_by(cyl_ch) |>
arrow::write_dataset("output/partition_cyl_ch")
If I run arrow::open_dataset("output/partition_cyl_ch"), the cyl_ch column is now an int32 (with values 4, 6, and 8 instead of "04", "06", and "08"), while the gear_ch column remains a string as intended.
I want the cyl_ch column to remain a string as well. There are no error messages, just an unexpected result.
It looks like the partitioned column itself (in this case, cyl_ch) isn't saved in the resulting parquet file but is instead inferred from the folder name; perhaps this is where the string is cast to an integer. For instance, if I directly read arrow::read_parquet("output/partition_cyl_ch/cyl_ch=04/part-0.parquet"), the cyl_ch column does not appear. There was a similar issue in the duckdb GitHub repository, but I can't find anything in the arrow repo.
Component(s)
R
The text was updated successfully, but these errors were encountered:
Hi @prayaggordy, you're right that when a dataset is written with partitioning, the partition fields aren't stored in the files.
Arrow's partitioning approach does auto detection like other systems but allows you to provide a schema as an alternative which I think should get you what you want:
Hi @amoeba, thanks for the response. It would work in a small example, but I'm working with a large number of files. I can make do with setting the schema myself, but if there is any way to add stronger typing for the partition field I would be most grateful.
I think the issue here is that the schema is the most reliable way of controlling the type of any variable, and it's inevitable that there will be issues when a data type has to be inferred and isn't provided explicitly. The undesirable behaviour here might be desirable behaviour for other users.
I think the solution provided by @amoeba is pretty solid, and a solution for automating it might be to write a wrapper function which looks up the partitioning columns and then generates a schema from that to then pass into open_dataset().
thisisnic
changed the title
Partitioned variable does not read in as the correct type
[R] Partitioned variable does not read in as the correct type
Jul 27, 2024
Describe the bug, including details regarding any error messages, version, and platform.
I'm running
arrow
16.1.0 with R 4.4.1 on a MacBook Pro M3, macOS 14.5, though this also occurred on Linux.I partitioned my dataset by a string column that could be parsed to integer. When I read the dataset back into R with
arrow::open_dataset
, the partitioned column became an integer, even though I saved it as a string.For example:
The resulting directory structure is:
If I run
arrow::open_dataset("output/partition_cyl_ch")
, thecyl_ch
column is now anint32
(with values 4, 6, and 8 instead of "04", "06", and "08"), while thegear_ch
column remains astring
as intended.I want the
cyl_ch
column to remain astring
as well. There are no error messages, just an unexpected result.It looks like the partitioned column itself (in this case,
cyl_ch
) isn't saved in the resulting parquet file but is instead inferred from the folder name; perhaps this is where the string is cast to an integer. For instance, if I directly readarrow::read_parquet("output/partition_cyl_ch/cyl_ch=04/part-0.parquet")
, thecyl_ch
column does not appear. There was a similar issue in the duckdb GitHub repository, but I can't find anything in the arrow repo.Component(s)
R
The text was updated successfully, but these errors were encountered: