[R] Partitioned variable does not read in as the correct type #43303

prayaggordy · 2024-07-17T17:09:33Z

Describe the bug, including details regarding any error messages, version, and platform.

I'm running arrow 16.1.0 with R 4.4.1 on a MacBook Pro M3, macOS 14.5, though this also occurred on Linux.

I partitioned my dataset by a string column that could be parsed to integer. When I read the dataset back into R with arrow::open_dataset, the partitioned column became an integer, even though I saved it as a string.

For example:

mtcars |>
  dplyr::mutate(cyl_ch = stringr::str_pad(cyl, 2, pad = "0"),
                gear_ch = stringr::str_pad(gear, 2, pad = "0")) |>
  dplyr::group_by(cyl_ch) |>
  arrow::write_dataset("output/partition_cyl_ch")

The resulting directory structure is:

output
  partition_cyl_ch
    cyl_ch=04
      part-0.parquet
    cyl_ch=06
      part-0.parquet
    cyl_ch=08
      part-0.parquet

If I run arrow::open_dataset("output/partition_cyl_ch"), the cyl_ch column is now an int32 (with values 4, 6, and 8 instead of "04", "06", and "08"), while the gear_ch column remains a string as intended.

I want the cyl_ch column to remain a string as well. There are no error messages, just an unexpected result.

It looks like the partitioned column itself (in this case, cyl_ch) isn't saved in the resulting parquet file but is instead inferred from the folder name; perhaps this is where the string is cast to an integer. For instance, if I directly read arrow::read_parquet("output/partition_cyl_ch/cyl_ch=04/part-0.parquet"), the cyl_ch column does not appear. There was a similar issue in the duckdb GitHub repository, but I can't find anything in the arrow repo.

Component(s)

R

The text was updated successfully, but these errors were encountered:

amoeba · 2024-07-18T17:10:49Z

Hi @prayaggordy, you're right that when a dataset is written with partitioning, the partition fields aren't stored in the files.

Arrow's partitioning approach does auto detection like other systems but allows you to provide a schema as an alternative which I think should get you what you want:

> my_schema <- schema(field("cyl_ch", string()))
> open_dataset("output/partition_cyl_ch", partitioning = my_schema)
FileSystemDataset with 3 Parquet files
13 columns
mpg: double
cyl: double
disp: double
hp: double
drat: double
wt: double
qsec: double
vs: double
am: double
gear: double
carb: double
gear_ch: string
cyl_ch: string

Would this work for your use case?

prayaggordy · 2024-07-22T21:26:57Z

Hi @amoeba, thanks for the response. It would work in a small example, but I'm working with a large number of files. I can make do with setting the schema myself, but if there is any way to add stronger typing for the partition field I would be most grateful.

thisisnic · 2024-07-27T11:42:31Z

I think the issue here is that the schema is the most reliable way of controlling the type of any variable, and it's inevitable that there will be issues when a data type has to be inferred and isn't provided explicitly. The undesirable behaviour here might be desirable behaviour for other users.

I think the solution provided by @amoeba is pretty solid, and a solution for automating it might be to write a wrapper function which looks up the partitioning columns and then generates a schema from that to then pass into open_dataset().

prayaggordy added the Type: bug label Jul 17, 2024

github-actions bot added the Component: R label Jul 17, 2024

thisisnic closed this as completed Jul 27, 2024

thisisnic changed the title ~~Partitioned variable does not read in as the correct type~~ [R] Partitioned variable does not read in as the correct type Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Partitioned variable does not read in as the correct type #43303

[R] Partitioned variable does not read in as the correct type #43303

prayaggordy commented Jul 17, 2024

amoeba commented Jul 18, 2024

prayaggordy commented Jul 22, 2024

thisisnic commented Jul 27, 2024

[R] Partitioned variable does not read in as the correct type #43303

[R] Partitioned variable does not read in as the correct type #43303

Comments

prayaggordy commented Jul 17, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

amoeba commented Jul 18, 2024

prayaggordy commented Jul 22, 2024

thisisnic commented Jul 27, 2024