Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43605: [Go][Parquet] Recover from panic in file reader #43607

Closed

Conversation

don4get
Copy link

@don4get don4get commented Aug 7, 2024

Rationale for this change

The Parquet file reader should handle panics within its internal goroutines to ensure graceful failure and prevent client-side crashes.

What changes are included in this PR?

ReadRowGroups now defers a panic recover from its goroutines and return an error to gracefully fail file reading.

Are these changes tested?

No more tests were added because I cannot reproduce such a corrupted file with dummy data.

Are there any user-facing changes?

No

This PR contains a "Critical Fix".

@don4get don4get requested a review from zeroshade as a code owner August 7, 2024 14:23
Copy link

github-actions bot commented Aug 7, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@don4get don4get changed the title [Go][Parquet] Recover from panic in file reader GH-43605: [Go][Parquet] Recover from panic in file reader Aug 7, 2024
Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43605 has been automatically assigned in GitHub to PR creator.

@zeroshade
Copy link
Member

Do we know why it was causing the panic in the first place? Can you add the corrupted file to https://github.com/apache/arrow-testing and then use that for a test?

@don4get
Copy link
Author

don4get commented Aug 7, 2024

Do we know why it was causing the panic in the first place? Can you add the corrupted file to https://github.com/apache/arrow-testing and then use that for a test?

Sorry but I cannot share this corrupted file. I tried to reproduce a dummy file reproducing the bug but I did not find a way to corrupt the dummy file the same way as the original one because I do not know the corruption cause.

@mapleFU
Copy link
Member

mapleFU commented Aug 7, 2024

apache/parquet-testing#48 have some corrput parquet file which is able to use, but it might need a few days to check and merge that. Maybe it's more easier to makeup a test with go writer?

@don4get
Copy link
Author

don4get commented Aug 8, 2024

I managed to reproduce the dummy failing data.
It is produced with polars with this version:

[[package]]
name = "polars"
version = "0.20.13"
description = "Blazingly fast DataFrame library"
optional = false
python-versions = ">=3.8"
files = [
    {file = "polars-0.20.13-cp38-abi3-macosx_10_12_x86_64.whl", hash = "sha256:63bb00eb32b151a949666f8ae050bd09159bca572c0eab17a17abe6ac50fc8e0"},
    {file = "polars-0.20.13-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:4ed694808252968dcda486dcd6009b8230bada83924d4f5d3359dbe3db6ab8e5"},
    {file = "polars-0.20.13-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1557e8947a263cefc1937cd047800678fbae8c9a475e6dada5b7dc6557180a4f"},
    {file = "polars-0.20.13-cp38-abi3-manylinux_2_24_aarch64.whl", hash = "sha256:8694d6fc307256e9e36b03975ccc89ce89290d6d661f75eb60e14e304d1e0968"},
    {file = "polars-0.20.13-cp38-abi3-win_amd64.whl", hash = "sha256:3917868d0a0331436a426f7acda24b2806e7f2458ee91f581d44765c9e87abe8"},
    {file = "polars-0.20.13.tar.gz", hash = "sha256:b3115c7499705d8f1a790add5806747a2eb3f19660d277e8e823199dcb66aeaf"},
]

It's a single column uint16 dataset, filled with 0 values.

The same dataset using polars with the following version does not reproduce the bug:

[[package]]
name = "polars"
version = "1.4.1"
description = "Blazingly fast DataFrame library"
optional = false
python-versions = ">=3.8"
files = [
    {file = "polars-1.4.1-cp38-abi3-macosx_10_12_x86_64.whl", hash = "sha256:f02fc6a5c63dd86cfeb159caa66112e477c69fc7800a28e64609ac2780554865"},
    {file = "polars-1.4.1-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:bd2acd8b1977f61b9587c8d47d16f101e7e73edd8cdeb3a8a725f15f181cd120"},
    {file = "polars-1.4.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7cf834a328e292c31c06eb606496becb6d8a795e927c826e26e2af27087950f1"},
    {file = "polars-1.4.1-cp38-abi3-manylinux_2_24_aarch64.whl", hash = "sha256:64eabf0ef7ac0d17fe15361e7daaeb4425a875d2d760c17d96803e9ac8bee244"},
    {file = "polars-1.4.1-cp38-abi3-win_amd64.whl", hash = "sha256:2313d63ecfa1d9f1e740b9fcabb8ae45d9d0b5acf1ddb401951daba4c0f3f74f"},
    {file = "polars-1.4.1.tar.gz", hash = "sha256:ed8009aff8cf91f94db5a38d947185603ad5bee48a28b764cf5a52048c7c4756"},
]

To be able to execute this test, we now need to wait for this other PR to be merged:
apache/parquet-testing#57

@don4get
Copy link
Author

don4get commented Aug 8, 2024

New PR dealing with the cause of the panic is opened:
#43616

@mapleFU
Copy link
Member

mapleFU commented Aug 16, 2024

Does this is real bad-data? I found my parquet-c++ can read this out ?

@don4get
Copy link
Author

don4get commented Aug 19, 2024

Does this is real bad-data? I found my parquet-c++ can read this out ?

Maybe it should not be considered as bad data but arrow-go did not manage to open it before my PR.
Should I move it to "DATA" folder?

@mapleFU
Copy link
Member

mapleFU commented Aug 19, 2024

I'll try to figure out it later, you can try to read another "bad_data" file firstly

I believe it's easy to re-producing with other bad data file

@zeroshade
Copy link
Member

@don4get the parquet-testing PR was merged, can you update the submodule here and push it so it can run the CI with the new file?

@don4get
Copy link
Author

don4get commented Aug 23, 2024

@don4get the parquet-testing PR was merged, can you update the submodule here and push it so it can run the CI with the new file?

Hello, the PR needed for testing has been approved but not merged:
apache/parquet-testing#57

Can you please merge it? I will update this PR once it is merged.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2024

I believe it's easy to re-producing with other bad data file

Would you mind read other bad-data?

@don4get
Copy link
Author

don4get commented Aug 23, 2024

I believe it's easy to re-producing with other bad data file

Would you mind read other bad-data?

ARROW-GH-41317 causes another panic which should be handled differently than it is in this PR:
More info in panic_while_reading_ARROW-GH-41317.txt

ARROW-GH-41321 causes yet another panic which should be handled like for #41317, in ReadTable.

Other bad data files have been added 6 years ago, I doubt they are relevant for the issue I am targeting in this PR.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2024

Sigh, I would check ARROW-GH-43605.parquet for now...

I still don't know, since this pr is just recover from panic, #43616 is related to the avx2 decoding in ARROW-GH-43605.parquet. Can we using other file in this patch? Or this is strongly related to ARROW-GH-43605.parquet and cannot be reproduce with other file?

@don4get
Copy link
Author

don4get commented Aug 23, 2024

Sigh, I would check ARROW-GH-43605.parquet for now...

Concerning the panics caused by #41317 and #41321, these are not urgent matter because they are in the same thread than the ReadTable call, which means users can simply handle the panic on their side. I would say it is cleaner to simply return an error instead of panicking. Yet since they can handle it on their side, it is not urgent.

For the issue I am pointing in #43605, the problem is more severe because the panic happens in a goroutine, it is not handled by arrow and the user simply cannot handle it. User program will crash, no workaround possible.

@don4get
Copy link
Author

don4get commented Aug 23, 2024

Sigh, I would check ARROW-GH-43605.parquet for now...

I still don't know, since this pr is just recover from panic, #43616 is related to the avx2 decoding in ARROW-GH-43605.parquet. Can we using other file in this patch? Or this is strongly related to ARROW-GH-43605.parquet and cannot be reproduce with other file?

#43605 highlights a panic in a goroutine, which in the end is a crash in avx2 decoding.

#41317 leads to a panic in validate, far from avx decoding. Plus, validate panics are explicitly written by go-arrow authors, we could say it is a API contract, even though it's not documented and difficult to anticipate for clients because ReadTable already returns an error.

I think I cannot I cannot reuse these files, it is not similar issues.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2024

I got this! Also, can you provide this link in polaris about how this issue is fixed in polaris? I cannot find that now

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2024

testing file is merged, sorrying for delaying to find the reason. just go-through for testing

@don4get
Copy link
Author

don4get commented Aug 26, 2024

I got this! Also, can you provide this link in polaris about how this issue is fixed in polaris? I cannot find that now

I don't know yet what fixed the issue in polars, I can only say for now that v1.4.1 is fixed, where as v0.20.13 caused the issue. I eventually investigate :)

testing file is merged, sorrying for delaying to find the reason. just go-through for testing

Thank you so much. I just updated the submodule both in the branch and the other one.

@don4get don4get force-pushed the recover-from-panic-in-file-reader branch from 5c291fc to ac12c20 Compare August 26, 2024 08:41
Comment on lines 394 to 396
if err != nil {
t.Errorf("unexpected error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.NoError(t, err)

Comment on lines 407 to 409
if err != nil {
t.Errorf("unexpected error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.NoError(t, err)

Comment on lines 413 to 415
if err == nil {
t.Errorf("expected error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.Error(t, err)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 28, 2024
@zeroshade
Copy link
Member

Looks like the last call in the unit test is getting nil when it expected an error, is that a problem with the test or the fix?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 29, 2024
Deferred function calls are executed in Last In First Out order after the surrounding function returns:
https://go.dev/blog/defer-panic-and-recover
@don4get
Copy link
Author

don4get commented Aug 29, 2024

Looks like the last call in the unit test is getting nil when it expected an error, is that a problem with the test or the fix?

I manage to reproduce this locally with ./ci/scripts/go_test.sh. However, when I call the test directly, either from IDE or command line, the test is OK:

PARQUET_TEST_BAD_DATA=/home/xxx/work/arrow-1/cpp/submodules/parquet-testing/bad_data go test -asan -timeout 30s -run ^TestReadParquetFile$ github.com/apache/arrow/go/v18/parquet/pqarrow

I initially thought it was linked to -asan flag, but it's not like this. Do you have an idea what is the difference between the ci script.

@don4get
Copy link
Author

don4get commented Aug 29, 2024

The test fails when this flag is set: -tags noasm. I don't know what it changes to the code base.
I can suggest to modify the test and remove the condition on the last error.
The main point of this PR is to prevent a panic coming from another goroutine.

@zeroshade
Copy link
Member

What is the error that you get when running it locally without the noasm flag? Did you expect there to be an error?

That particular flag controls whether or not optimized assembly code is utilized for decoding values, when that flag is set the library will fall back to pure go implementations of several decoding aspects such as unpacking booleans or packed integers. Without that flag, SIMD (vectorized) implementations of those functions are used instead on a per-architecture basis (falling back to pure go if we don't have optimized assembly for a given platform).

@don4get
Copy link
Author

don4get commented Sep 12, 2024

What is the error that you get when running it locally without the noasm flag? Did you expect there to be an error?

That particular flag controls whether or not optimized assembly code is utilized for decoding values, when that flag is set the library will fall back to pure go implementations of several decoding aspects such as unpacking booleans or packed integers. Without that flag, SIMD (vectorized) implementations of those functions are used instead on a per-architecture basis (falling back to pure go if we don't have optimized assembly for a given platform).

    file_reader_test.go:409: 
        	Error Trace:	/arrow/go/parquet/pqarrow/file_reader_test.go:409
        	Error:      	An error is expected but got nil.
	_, err = arrowRdr.ReadTable(ctx)

	assert.Error(t, err)

https://github.com/don4get/arrow/blob/589c3e4df00ca906c65830fa4e90ca182f424276/go/parquet/pqarrow/file_reader_test.go#L409

While the default implementation fails to read the table (it returns an error), the noasm version does not fail (does not return an error).

Should I adapt the expect error if the flag noasm is present or not?

@zeroshade
Copy link
Member

@don4get I figured out the issue and uploaded a fix so that the tests should all work properly now.

@zeroshade
Copy link
Member

As the Go implementation has shifted to the github.com/apache/arrow-go repository, once all the CI here passes, I'm going to shift this to the new repo and merge it there instead as we are no longer merging Go PRs on this repo. (We've only held off on shifting ALL the Go PRs while we get the CI fully complete over there).

@don4get
Copy link
Author

don4get commented Sep 12, 2024

@don4get I figured out the issue and uploaded a fix so that the tests should all work properly now.

Thanks @zeroshade :)

@zeroshade
Copy link
Member

Closing this to move it to apache/arrow-go#124 for the new Arrow Go repo.

@zeroshade zeroshade closed this Sep 13, 2024
zeroshade added a commit to apache/arrow-go that referenced this pull request Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants