Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firestore.BulkWriter.flush() has a race #1772

Open
thewildllama opened this issue Jul 31, 2024 · 4 comments · May be fixed by #1778
Open

firestore.BulkWriter.flush() has a race #1772

thewildllama opened this issue Jul 31, 2024 · 4 comments · May be fixed by #1778
Assignees
Labels
api: firestore Issues related to the googleapis/java-firestore API.

Comments

@thewildllama
Copy link

bulkWriter.flush() (and therefore bulkWriter.close() have a race when checking to see if there are more items to flush.

We noticed this in our Cloud Function application that tries to write 30,000 records to a collection.

I have a fix that is working here:
https://github.com/thewildllama/java-firestore/tree/bulk-writer-flush-fix

I detailed the problem in my commit message:

Flush checks the size of the current batch to see if it still has work,
but a race can happen if pendingOpsCount > maxPendingOpCount and
operations have been placed in bufferedOperations.

When flush is called, scheduleCurrentBatchLocked() used to be called
with flush=true, which would call sendBatchLocked() with
flush=true as well. If the call to sendBatchLocked() did not get
rate limited, it would add a listener to the actual batch write that
would call scheduleCurrentBatchLocked() again if flush == true.

If the current batch's size is 0, scheduleCurrentBatchLocked() simply
returns, ending the flush.

If pendingOpsCount > maxPendingOpCount and operations are being buffered into
bufferedOperations instead of a batch, there is a chance that
scheduleCurrentBatchLocked() will execute before any pending operations
complete, and the current batch is empty. Thus ending the "flush".

This fix's approach avoids that with a global flush state and a flush
"endpoint" that marks the last operation that needs to be flushed. This
allows flush() to be called more than once (including a terminal call to
close() if it came to that).

We will end up calling scheduleCurrentBatchLocked() more often from
sendBatchLocked(), which is done via callback... so there's going to
be some more overhead, there. When we did inserts of 30,000 in quick
succession, many of the calls to sendBatchLocked() were coming from
the rate limit backoff branch, and those all had flush=false.

Environment details

OS type and version: Linux/macOS
Java version: 21
Version(s): com.google.cloud:google-cloud-firestore:3.22.0, com.google.cloud:google-cloud-firestore:3.24.2

Steps to reproduce

  1. Insert 30,000 items into a collection within a try with resources block
  2. Application hangs once exiting the try with resources block

Code example

@JsonIgnoreProperties(ignoreUnknown = true)
public record HistoricalReference(
    @JsonProperty("Identifier")
    String identifier,
    @JsonProperty("Start date")
    LocalDate startDate
    // ...
) { }
try (BulkWriter bulkWriter = firestore.bulkWriter()) {
    historicalReferences.forEach(
        historicalReference -> {
            DocumentReference docRef = collectionReference.document(historicalReference.identifier());
            bulkWriter.set(
                docRef,
                // Can't wait for record support!
                OBJECT_MAPPER.convertValue(historicalReference,new TypeReference<>() {})
            );
        }
    );
}
@product-auto-label product-auto-label bot added the api: firestore Issues related to the googleapis/java-firestore API. label Aug 1, 2024
@milaGGL milaGGL self-assigned this Aug 2, 2024
@tom-andersen tom-andersen linked a pull request Aug 3, 2024 that will close this issue
@thewildllama
Copy link
Author

@tom-andersen - Nice fix! Way cooler than mine. :)

@thewildllama
Copy link
Author

@tom-andersen & @milaGGL - I've been testing with the fix for a few days, and I still see a few problems. The one I've been running down is difficult for me to understand, but here's what I've got so far:

We create somewhere between 15,000 and 30,000 write operations. Occasionally, something will happen that will prevent some of the earlier operations from making their way into a BulkCommitBatch until the very end of the batch handling. The chain of futures leading up to lastFlushOperation gets very long.

In one case I have logs to, I see the 819th - 837th operation not get into a batch until the final operation - in this case, the 14,530th operation. When those operations finally run and finish, they are the last. And now every operation has been executed successfully. The futures chain starts propagating success up towards lastFlushOperation, but something happens and everything falls off the rails somewhere around the 13,174th operation. At that point, propagation stops, and lastFlushOperation will never finish.

This does not happen every time. And I have varying streaks of lucky and unlucky.

@milaGGL
Copy link
Contributor

milaGGL commented Aug 8, 2024

Is this happening with both fixes?

@thewildllama
Copy link
Author

I only saw "fix: BulkWriter flush logic #1778"... is there another? Or you mean my attempt? I saw things getting stuck with mine, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the googleapis/java-firestore API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants