Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove operations pull completed tokens from txn-queue #301

Open
wants to merge 9 commits into
base: development
Choose a base branch
from

Conversation

jameinel
Copy link

@jameinel jameinel commented Nov 19, 2018

We ran into an odd edge case in our system, where we had a document that was being created and then removed repeatedly. (People were testing if the system worked by creating an object and then immediately removing it.)

The code for Insert and Remove just copies whatever the txn-queue was from the collection to the stash and back. This changes it so that both clean up the txn-queue while it is creating the document in the stash. This should be functionally equivalent to the $pullAll that we do during Update. It should be safe to do this because we are expressly creating the document out of 'whole cloth' with an Insert, so it isn't possible for us to be racing with another thread that is trying to add another txn to the queue. (If it was, this doesn't change the race that we would have already had).

There is an extra pass over the freshly read txn-queue, but in well behaved systems, the queue should always be kept small. This is mostly a case where we found we weren't behaving well.

We are directly inserting the Queue, but instead of just blindly copying
what was in the old document, apply the $pullAll that we would have
applied if this were an Update. Because we are creating the stash
document from scratch, we don't need to ask Mongo to do the work.
Essentially, we don't have to do both Insert and Remove, because one of
them must have been done in order to do the other.

This way, a series of only Insert and Remove operations won't cause the
txn-queue to get out of hand.
@jameinel jameinel changed the title Txn globalsign remove prunes queue Remove operations Pull completed tokens from txn-queue Nov 19, 2018
@jameinel jameinel changed the title Remove operations Pull completed tokens from txn-queue Remove operations pull completed tokens from txn-queue Nov 19, 2018
@jameinel
Copy link
Author

I should also note that I added a -fast flag to the txn test suite, as the StashStressTest takes about 2minutes for me, while the rest of the test suite takes something like 15s total time.

@jameinel
Copy link
Author

hm... I just had a failure on:

FAIL: sim_test.go:110: S.TestSimReinsertCopy4Workers

sim_test.go:308:
c.Check(err, IsNil)
... value *errors.errorString = &errors.errorString{s:"cannot find transaction 5bf2c04f93538f4ab865a02b_3acb2035 in queue for document {accounts 9}"} ("cannot find transaction 5bf2c04f93538f4ab865a02b_3acb2035 in queue for document {accounts 9}")

I'll investigate as that is the sort of race this change could have introduced.

It turns out that I missed we had just reread the document as part of
the 'add txn-insert' or 'add txn-remove' to the document. That gives us
an updated txn-queue that needs to be handled. Since we are now cleaning
directly on the txn-queue instead of the cached queue docs, go ahead and
do it on both Insert and Remove.
@jameinel
Copy link
Author

To follow up, I had missed that the insert that flags the item as marked for removal/marked for insert rereads the txn-queue, and that is the queue that we were using for the stash value. Not the queue that we started the function with. But that should be handled, and I ran
go test -fast -count=20
without any failures in the simulation tests, so I'm reasonably sure this is correct.

manadart
manadart previously approved these changes Nov 19, 2018
@@ -998,6 +1004,23 @@ func tokensToPull(dqueue []tokenAndId, pull map[bson.ObjectId]*transaction, dont
return result
}

// cleanQueue takes an existing queue and removes all the items in pull from it, generating a new txn-queue.
func cleanQueue(queue []token, pull map[bson.ObjectId]*transaction, dontPull token) ([]token) {
cleaned := make([]token, 0, len(queue))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When filtering like this, a nice trick I like is:
cleaned := queue[:0]
Because the slice is created with the same underlying array, there's no re-allocation up to the original length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost went this way, but ended up avoiding it because it modifies the original object. We do use it for debug logging later on, so I didn't want to confuse the situation, especially since it doesn't rewrite info.Queue's length, it means an array that could end up with the tail doubled.

txn/flusher.go Outdated
@@ -998,6 +1004,23 @@ func tokensToPull(dqueue []tokenAndId, pull map[bson.ObjectId]*transaction, dont
return result
}

// cleanQueue takes an existing queue and removes all the items in pull from it, generating a new txn-queue.
func cleanQueue(queue []token, pull map[bson.ObjectId]*transaction, dontPull token) ([]token) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is filterQueue a better name for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me. (done)

txn/txn_test.go Outdated
var qdoc txnQueue
err = s.accounts.FindId(0).One(&qdoc)
c.Assert(err, IsNil)
err = s.accounts.FindId(0).One(&qdoc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are reason to find the doc twice? Or is this an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a copy paste error in the test, fixed.

jameinel added a commit to jameinel/juju that referenced this pull request Nov 20, 2018
This is the patch from
  juju/mgo#4
  globalsign/mgo#301

It filters the txn-queue when copying a document from or to the stash,
just like we would do during a normal Update operation. This prevents a
sequence of Insert + Remove pairs from filling up the txn-queue.
jujubot added a commit to juju/juju that referenced this pull request Nov 20, 2018
#9486

## Description of change

This is the patch from
 juju/mgo#4
 globalsign/mgo#301

It filters the txn-queue when copying a document from or to the stash,
just like we would do during a normal Update operation. This prevents a
sequence of Insert + Remove pairs from filling up the txn-queue.

## QA steps

The associated test shows the basic change. To trigger this manually you can:

```
$ juju bootstrap lxd
$ while true; do juju deploy cs:~jameinel/ubuntu-lite; sleep 90; juju remove-application ubuntu-lite; sleep 30 done
```

While doing that, you should be able to go into the juju controller and either
```
> db.constraints.find()
> db.txns.stash.find({"_id.c": "constraints"})
```
And see that there is a document whose txn-queue is growing quite large. With this patch, the txn-queue should stay around 1 or 2 entries.

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1804197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants