Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with parallel uploads to the same blob #462

Open
M0dEx opened this issue Mar 4, 2024 · 1 comment
Open

Issue with parallel uploads to the same blob #462

M0dEx opened this issue Mar 4, 2024 · 1 comment

Comments

@M0dEx
Copy link

M0dEx commented Mar 4, 2024

There seems to be an issue when 2 instances of this file system write to the same blob from 2 different processes in parallel, where one of the uploads fails with:

Azure error
    File "/code/.venv/lib/python3.10/site-packages/our_package/connector/storage/blob.py", line 117, in _save
        with self._fs.open(
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1963, in __exit__
        self.close()
    File "/code/.venv/lib/python3.10/site-packages/adlfs/spec.py", line 1908, in close
        super().close()
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1930, in close
        self.flush(force=True)
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1801, in flush
        if self._upload_chunk(final=force) is not False:
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
        return sync(self.loop, func, *args, **kwargs)
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
        raise return_result
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
        result[0] = await coro
    File "/code/.venv/lib/python3.10/site-packages/adlfs/spec.py", line 2068, in _async_upload_chunk
        await bc.commit_block_list(
    File "/code/.venv/lib/python3.10/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
        return await func(*args, **kwargs)
    File "/code/.venv/lib/python3.10/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 1861, in commit_block_list
        process_storage_error(error)
    File "/code/.venv/lib/python3.10/site-packages/azure/storage/blob/_shared/response_handlers.py", line 184, in process_storage_error
        exec("raise error from None")   # pylint: disable=exec-used # nosec
    File "<string>", line 1, in <module>
    
azure.core.exceptions.HttpResponseError: The specified block list is invalid.
RequestId:<request_id>
Time:2024-02-13T12:15:05.1957595Z
ErrorCode:InvalidBlockList
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlockList</Code><Message>The specified block list is invalid.

From our limited investigation, this seems to likely be caused by the way AzureBlobFile calculates the IDs of the uploaded blocks:

adlfs/adlfs/spec.py

Lines 2102 to 2103 in 576fb7a

block_id = len(self._block_list)
block_id = f"{block_id:07d}"

Could this be changed to a hash of the content or something similar, which would correspond to the actual contents of the uploaded block?

@cmp0xff
Copy link

cmp0xff commented Sep 5, 2024

Hi, it seems to me that we can do this:

  • from hashlib import shake_128, and in class AzureBlobFile, create
      def _block_id(self, block_list: list[str] | None = None):
          if block_list is None:
              block_list = self._block_list
    
          return shake_128(str(block_list).encode()).hexdigest(4)[:-1]
  • In

    adlfs/adlfs/spec.py

    Lines 2102 to 2103 in 576fb7a

    block_id = len(self._block_list)
    block_id = f"{block_id:07d}"
          block_id = self._block_id()
  • In

    adlfs/adlfs/spec.py

    Lines 2116 to 2117 in 576fb7a

    block_id = len(self._block_list)
    block_id = f"{block_id:07d}"
                          block_id = self._block_id()
  • In
    if block_id == "0000000" and length == 0 and final:
                  if block_id == self._block_id([]) and length == 0 and final:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants