Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43349: [R] Fix altrep string columns from readr #43351

Merged
merged 6 commits into from
Jul 27, 2024

Conversation

jonkeane
Copy link
Member

@jonkeane jonkeane commented Jul 21, 2024

Rationale for this change

To resolve the reverse dependency issue with parquetize

What changes are included in this PR?

One step towards resolving the issue

Are these changes tested?

yes

Are there any user-facing changes?

no

@jonkeane jonkeane requested a review from thisisnic as a code owner July 21, 2024 02:46
Copy link

⚠️ GitHub issue #43349 has been automatically assigned in GitHub to PR creator.

@@ -148,7 +148,7 @@ inline SEXP utf8_strings(SEXP x) {

for (R_xlen_t i = 0; i < n; i++, ++p_x) {
SEXP s = *p_x;
if (s != NA_STRING) {
if (s != NA_STRING && ALTREP(s)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually work yet. With this

test_that("We handle non-UTF strings", {
x <- iconv("Veitingastaðir", to = "latin1")
df <- tibble::tibble(
chr = x,
fct = as.factor(x)
)
names(df) <- iconv(paste(x, names(df), sep = "_"), to = "latin1")
df_struct <- tibble::tibble(a = df)
raw_schema <- list(utf8(), dictionary(int8(), utf8()))
names(raw_schema) <- names(df)
# Confirm setup
expect_identical(Encoding(x), "latin1")
expect_identical(Encoding(names(df)), c("latin1", "latin1"))
expect_identical(Encoding(df[[1]]), "latin1")
expect_identical(Encoding(levels(df[[2]])), "latin1")
# Array
expect_identical(as.vector(Array$create(x)), x)
fail.

Before #43173 this line was:

if (s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s)) {

I suspect what's going on is that we caught vroom altrep vectors with this if and called SET_STRING_ELT so we didn't get the No Set_elt found for ALTSTRING error. But which ALTREP(s) we detect the non-utf strings here too and attempt to SET_STRING_ELT when we shouldn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts @nealrichardson ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand what's happening in here. L144 says that this should no longer be altrep by the time we get here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.

Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.

Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it, and only call SET_STRING_ELT() if the returned pointer is different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.

Hmm, so maybe the actual problem here is that that isn't working on the ALTREP vectors coming from vroom? Here is the error we started seeing when we changed this condition from s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s) to s != NA_STRING:

Error in Table__from_dots(dots, schema, option_use_threads()): No Set_elt found for ALTSTRING class [class: vroom_chr, pkg: vroom]

Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.

Yeah, I agree that ALTREP(s) here is not right.

Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it,

I will admit I'm not 100% certain what is going on here, but I catching non-UTF-8 strings here is what I thought it was doing.

and only call SET_STRING_ELT() if the returned pointer is different?

Oh interesting. Forgive my C++ naïveness, but would this be something like (i.e. does != check if the pointer is different>):

SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
    SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(new_s), CE_UTF8));
}

Copy link

@gaborcsardi gaborcsardi Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that C++ looks correct, except you don't need to translate twice:

SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
    SET_STRING_ELT(x, i, Rf_mkCharCE(new_s));
}

However, I think the issue is that if x is ALTREP, then it does not matter if it is materialized of not, it is still ALTREP, so you still cannot change it with SET_STRING_ELT.

To avoid this, you'd need to call Rf_duplicate() on x. That should create a non-altrep copy of it.

You can always call Rf_duplicate(), or you can call it when you encounter a non-utf8 string. Note that you need to PROTECT() the result of Rf_duplicate()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah thanks for those pointers! That looked like it worked, I'm going to trigger a benchmark run to see if there are any unintended consequences there (Hopefully no, but even if so, we might need to accept them!)

r/DESCRIPTION Outdated
@@ -62,6 +62,7 @@ Suggests:
lubridate,
pillar,
pkgload,
readr,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the test is worth adding to suggests, though we will get rid of that annoying xref error if we do.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 21, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024
@jonkeane
Copy link
Member Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jul 23, 2024

Benchmark runs are scheduled for commit 2a0da1e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 23, 2024
// ensure that x is not actually altrep first
if (ALTREP(x)) {
x = PROTECT(Rf_duplicate(x));
UNPROTECT(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this isn't right, doesn't the UNPROTECT need to go after you're done using the thing you PROTECTed?

I'm also not sure how the C-level PROTECT stuff interacts with unwind_protect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm surprised I don't get any stack imbalance issues with it. I do get them if I don't have UNPROTECT or I call it unconditionally. But anyway I have a slightly different version incoming.

Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 2a0da1e.

There were 38 benchmark results with an error:

There were 11 benchmark results indicating a performance regression:

The full Conbench report has more details.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024
@jonkeane
Copy link
Member Author

@github-actions crossbow submit -g r

Copy link

Revision: 206e94d

Submitted crossbow builds: ursacomputing/crossbow @ actions-6de7c6d005

Task Status
r-binary-packages GitHub Actions
test-r-arrow-backwards-compatibility GitHub Actions
test-r-clang-sanitizer GitHub Actions
test-r-depsource-bundled Azure
test-r-depsource-system GitHub Actions
test-r-dev-duckdb GitHub Actions
test-r-devdocs GitHub Actions
test-r-gcc-11 GitHub Actions
test-r-gcc-12 GitHub Actions
test-r-install-local GitHub Actions
test-r-install-local-minsizerel GitHub Actions
test-r-linux-as-cran GitHub Actions
test-r-linux-rchk GitHub Actions
test-r-linux-valgrind GitHub Actions
test-r-minimal-build Azure
test-r-offline-maximal GitHub Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-debian-gcc-release-custom-ccache Azure
test-r-rhub-ubuntu-release-latest Azure
test-r-rocker-r-ver-latest Azure
test-r-rstudio-r-base-4.1-opensuse155 Azure
test-r-rstudio-r-base-4.2-focal Azure
test-r-ubuntu-22.04 GitHub Actions
test-r-versions GitHub Actions
test-ubuntu-r-sanitizer GitHub Actions

// ensure that x is not actually altrep first
bool was_altrep = ALTREP(x);
if (was_altrep) {
x = PROTECT(Rf_duplicate(x));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about why we have to duplicate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll expand what I have there

@@ -152,6 +157,9 @@ inline SEXP utf8_strings(SEXP x) {
SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to check whether Rf_translateCharUTF8() actually modified anything? Or do we trust that SET_STRING_ELT is a no-op in that case? I would imagine that in most cases, we already have ascii/utf-8 strings, so this whole function should be basically free. That should be easily verified by microbenchmarking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this from above?

SEXP new_s = Rf_translateCharUTF8(s);
if (new_s == s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s, CE_UTF8));
}

Yeah, that's probably good. It'll also make this slightly more inline with what was there before (checking that not utf8)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this, but after getting it to work with a bit of type faffing, doing the translate first we get errors with our utf string tests again.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 24, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 25, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Jul 25, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 27, 2024
@jonkeane jonkeane merged commit 187197c into apache:main Jul 27, 2024
10 checks passed
@jonkeane jonkeane removed the awaiting change review Awaiting change review label Jul 27, 2024
jonkeane added a commit that referenced this pull request Jul 27, 2024
### Rationale for this change

To resolve the reverse dependency issue with `parquetize`

### What changes are included in this PR?

One step towards resolving the issue

### Are these changes tested?

yes

### Are there any user-facing changes?

no
* GitHub Issue: #43349

Authored-by: Jonathan Keane <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 187197c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants