Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling from btrfs_reserve_extent can result in deadlocks with compression #22

Open
josefbacik opened this issue Mar 3, 2021 · 0 comments
Labels
bug Something isn't working kernel Kernel related issue

Comments

@josefbacik
Copy link
Contributor

Generally speaking we should never fail to allocate space from btrfs_reserve_extent in cow_file_range, if we do then we've messed up with the ENOSPC stuff. However we could fail to allocate memory elsewhere in this loop, and bugs do happen (that's how I noticed this problem).

The problem exists with compression. We will hand off a range of locked pages to compress to the async threads. If they choose to not do compression, we will call cow_file_range with unlock == 0 for the entire range. Assume we are making a 128MIB allocation, but the allocator is only able to satisfy 64MIB in the first loop, it will create the ordered extent and set up the pages, but not unlock them for that first 64mib. Then we adjust start and try to allocate the rest of the 64MIB range. If this fails, we go to out_unlock and properly clear the rest of the range, which is 64MIB-128MIB. However we still have the first range locked. We need to still call extent_write_locked_range for this first chunk, because we successfully allocated that area and have pages waiting to be written out for them.

The solution is to somehow let the caller know that we successfully handled a sub-range of the range asked for, so we can do the right thing. The trick is for the normal buffered IO case we unconditionally mark the first page with PageError() if we get an error from cow_file_range(). In the case where we fail at some future range, this is actually wrong. And we potentially end up with the same problem where we don't initiate writes on the initial page and now we have an ordered extent that will never complete because the IO wasn't issued for the initial page.

This is kind of a weird thing to untangle, the best bet would be to enable error injection for cow_file_range, and then start reproducing the hangs and fixing the problems that fall out.

@josefbacik josefbacik added bug Something isn't working kernel Kernel related issue labels Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working kernel Kernel related issue
Projects
None yet
Development

No branches or pull requests

1 participant