Skip to content

Commit

Permalink
apacheGH-36990: [R] Expose Parquet ReaderProperties (apache#36992)
Browse files Browse the repository at this point in the history
### Rationale for this change

Expose the ReaderProperties class in R so that the thrift size settings can be altered.

### What changes are included in this PR?

Add R6 class, link it up to the C++ class, use it when reading Parquet files.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Nope
* Closes: apache#36990

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
  • Loading branch information
thisisnic authored Aug 14, 2023
1 parent 9cabd94 commit cd8830b
Show file tree
Hide file tree
Showing 14 changed files with 311 additions and 20 deletions.
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@ export(ParquetFileFormat)
export(ParquetFileReader)
export(ParquetFileWriter)
export(ParquetFragmentScanOptions)
export(ParquetReaderProperties)
export(ParquetVersionType)
export(ParquetWriterProperties)
export(Partitioning)
Expand Down
28 changes: 24 additions & 4 deletions r/R/arrowExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 14 additions & 2 deletions r/R/dataset-format.R
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,13 @@ csv_file_format_read_opts <- function(schema = NULL, ...) {
#' * `buffer_size`: Size of buffered stream, if enabled. Default is 8KB.
#' * `pre_buffer`: Pre-buffer the raw Parquet data. This can improve performance
#' on high-latency filesystems. Disabled by default.
#' * `thrift_string_size_limit`: Maximum string size allocated for decoding thrift
#' strings. May need to be increased in order to read
#' files with especially large headers. Default value
#' 100000000.
#' * `thrift_container_size_limit`: Maximum size of thrift containers. May need to be
#' increased in order to read files with especially large
#' headers. Default value 1000000.
#
#' `format = "text"`: see [CsvConvertOptions]. Note that options can only be
#' specified with the Arrow C++ library naming. Also, "block_size" from
Expand Down Expand Up @@ -571,8 +578,13 @@ CsvFragmentScanOptions$create <- function(...,
ParquetFragmentScanOptions <- R6Class("ParquetFragmentScanOptions", inherit = FragmentScanOptions)
ParquetFragmentScanOptions$create <- function(use_buffered_stream = FALSE,
buffer_size = 8196,
pre_buffer = TRUE) {
dataset___ParquetFragmentScanOptions__Make(use_buffered_stream, buffer_size, pre_buffer)
pre_buffer = TRUE,
thrift_string_size_limit = 100000000,
thrift_container_size_limit = 1000000) {
dataset___ParquetFragmentScanOptions__Make(
use_buffered_stream, buffer_size, pre_buffer, thrift_string_size_limit,
thrift_container_size_limit
)
}

#' @usage NULL
Expand Down
48 changes: 47 additions & 1 deletion r/R/parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,7 @@ ParquetFileWriter$create <- function(schema,
#' (e.g. `RandomAccessFile`).
#' - `props` Optional [ParquetArrowReaderProperties]
#' - `mmap` Logical: whether to memory-map the file (default `TRUE`)
#' - `reader_props` Optional [ParquetReaderProperties]
#' - `...` Additional arguments, currently ignored
#'
#' @section Methods:
Expand Down Expand Up @@ -541,12 +542,13 @@ ParquetFileReader <- R6Class("ParquetFileReader",
ParquetFileReader$create <- function(file,
props = ParquetArrowReaderProperties$create(),
mmap = TRUE,
reader_props = ParquetReaderProperties$create(),
...) {
file <- make_readable_file(file, mmap)
assert_is(props, "ParquetArrowReaderProperties")
assert_is(file, "RandomAccessFile")

parquet___arrow___FileReader__OpenFile(file, props)
parquet___arrow___FileReader__OpenFile(file, props, reader_props)
}

#' @title ParquetArrowReaderProperties class
Expand Down Expand Up @@ -625,3 +627,47 @@ calculate_chunk_size <- function(rows, columns,

chunk_size
}

#' @title ParquetReaderProperties class
#' @rdname ParquetReaderProperties
#' @name ParquetReaderProperties
#' @docType class
#' @usage NULL
#' @format NULL
#' @description This class holds settings to control how a Parquet file is read
#' by [ParquetFileReader].
#'
#' @section Factory:
#'
#' The `ParquetReaderProperties$create()` factory method instantiates the object
#' and takes no arguments.
#'
#' @section Methods:
#'
#' - `$thrift_string_size_limit()`
#' - `$set_thrift_string_size_limit()`
#' - `$thrift_container_size_limit()`
#' - `$set_thrift_container_size_limit()`
#'
#' @export
ParquetReaderProperties <- R6Class("ParquetReaderProperties",
inherit = ArrowObject,
public = list(
thrift_string_size_limit = function() {
parquet___arrow___ReaderProperties__get_thrift_string_size_limit(self)
},
set_thrift_string_size_limit = function(size) {
parquet___arrow___ReaderProperties__set_thrift_string_size_limit(self, size)
},
thrift_container_size_limit = function() {
parquet___arrow___ReaderProperties__get_thrift_container_size_limit(self)
},
set_thrift_container_size_limit = function(size) {
parquet___arrow___ReaderProperties__set_thrift_container_size_limit(self, size)
}
)
)

ParquetReaderProperties$create <- function() {
parquet___arrow___ReaderProperties__Make()
}
1 change: 1 addition & 0 deletions r/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@ reference:
- title: File read/writer interface
contents:
- ParquetFileReader
- ParquetReaderProperties
- ParquetArrowReaderProperties
- ParquetFileWriter
- ParquetWriterProperties
Expand Down
7 changes: 7 additions & 0 deletions r/man/FragmentScanOptions.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions r/man/ParquetFileReader.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions r/man/ParquetReaderProperties.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit cd8830b

Please sign in to comment.