Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle race condition on cache retention #569

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -98,11 +98,16 @@ public RemovalListener<ChunkKey, Path> removalListener() {
return (key, path, cause) -> {
try {
if (path != null) {
final long fileSize = Files.size(path);
Files.delete(path);
metrics.chunkDeleted(fileSize);
log.trace("Deleted cached file for key {} with path {} from cache directory."
+ " The reason of the deletion is {}", key, path, cause);
if (Files.exists(path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still failing to understand how it is possible that path will be null or file will not exist. I think it should not be the case or otherwise something is quite wrong IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a weird case.

The RemovalListener states:

An instance may be called concurrently by multiple threads to process different entries.

This confirms that it could be a race condition by separate threads calling the removal. I also find weird that if there's a wining thread, the metric is not increased.

Also

Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.

We currently already handle the case where path/value is null by logging an error. This could be seen as an extension of handling the scenario where the referenced file is not found.
We are still logging this to troubleshoot if/when this happens.

final long fileSize = Files.size(path);
Files.delete(path);
Comment on lines +101 to +103
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the racy nature of things here, a file may be deleted between the Files.exists check and Files.delete. I think, we should instead try to delete without a check and catch the "file not found" exception. "It is better to ask forgiveness than permission" :)

metrics.chunkDeleted(fileSize);
log.trace("Deleted cached file for key {} with path {} from cache directory."
+ " The reason of the deletion is {}", key, path, cause);
} else {
log.debug("Deletion of cached file for key {} with "
+ "path {} is requested by file is not found", key, path);
}
} else {
log.warn("Path not present when trying to delete cached file for key {} from cache directory."
+ " The reason of the deletion is {}", key, cause);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,8 @@ void metrics() throws IOException, JMException, StorageBackendException {
final DiskChunkCache diskChunkCache = new DiskChunkCache(chunkManager, time);
diskChunkCache.configure(Map.of(
"size", size1, // enough to put the first, but not both
"path", baseCachePath.toString()
"path", baseCachePath.toString(),
"retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea of the test was different. It was intended to test that only 1 chunk is deleted because the cache size reached the limit and because of Window TinyLfu it is not possible to say first or second chunk will be deleted. Specifying retention.ms I believe we are missing the case when the metrics are reported correctly for a single chunk since there is a high chance that both will be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen failing test cases (unfortunately too old to reference) that are not dependent on time, meaning that even if we increase the timeout to more than 30s it will not help: the eviction already failed (for some reason; could be the file not exist exception), and the metrics will not tick.
This test is not meant to check if the deletion is working properly (I think it's out of the scope of this test), but to check that the deletion metric is reported.
So expanding the cache configuration to trigger deletion in a more consistent way seem like a fair trade to remove flakiness (otherwise we wait for the flaky test to trigger and hope to get enough logs to troubleshoot -- this hasn't been the case.. CI logs only show "test failed" without including logs.. maybe something else to fix?)

));

diskChunkCache.getChunk(OBJECT_KEY_PATH, SEGMENT_MANIFEST, 0);
Expand Down Expand Up @@ -118,28 +119,40 @@ void metrics() throws IOException, JMException, StorageBackendException {
assertThat(MBEAN_SERVER.getAttribute(objectName, "write-bytes-rate"))
.isEqualTo(((double) (size1 + size2)) / METRIC_TIME_WINDOW_SEC);

// because of the retention ms, it may be deleting cached values 1, 2 or both.
await("Deletion happens")
.atMost(Duration.ofSeconds(30)) // increase to reduce chance of flakiness
.atMost(Duration.ofSeconds(30))
.pollDelay(Duration.ofMillis(100))
.pollInterval(Duration.ofMillis(100))
.until(() -> (double) MBEAN_SERVER.getAttribute(objectName, "delete-total") > 0);

assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-total"))
.isEqualTo(1.0);
.asInstanceOf(DOUBLE)
.satisfiesAnyOf(
deleteTotal -> assertThat(deleteTotal).isEqualTo(1),
deleteTotal -> assertThat(deleteTotal).isEqualTo(2)
);
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-rate"))
.isEqualTo(1.0 / METRIC_TIME_WINDOW_SEC);
.satisfiesAnyOf(
deleteTotalRate -> assertThat(deleteTotalRate).isEqualTo(1.0 / METRIC_TIME_WINDOW_SEC),
deleteTotalRate -> assertThat(deleteTotalRate).isEqualTo(2.0 / METRIC_TIME_WINDOW_SEC)
);

assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-bytes-total"))
.asInstanceOf(DOUBLE)
.satisfiesAnyOf(
deleteBytesTotal -> assertThat(deleteBytesTotal).asInstanceOf(DOUBLE).isEqualTo(size1),
deleteBytesTotal -> assertThat(deleteBytesTotal).asInstanceOf(DOUBLE).isEqualTo(size2)
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size1),
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size2),
deleteBytesTotal -> assertThat(deleteBytesTotal).isEqualTo(size1 + size2)
);
assertThat(MBEAN_SERVER.getAttribute(objectName, "delete-bytes-rate"))
.satisfiesAnyOf(
deleteBytesRate -> assertThat(deleteBytesRate)
.isEqualTo((double) size1 / METRIC_TIME_WINDOW_SEC),
deleteBytesRate -> assertThat(deleteBytesRate)
.isEqualTo((double) size2 / METRIC_TIME_WINDOW_SEC)
.isEqualTo((double) size2 / METRIC_TIME_WINDOW_SEC),
deleteBytesRate -> assertThat(deleteBytesRate)
.isEqualTo((double) (size1 + size2) / METRIC_TIME_WINDOW_SEC)
);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ void timeBasedEviction() throws IOException, StorageBackendException, Interrupte
void sizeBasedEviction() throws IOException, StorageBackendException {
cache.configure(Map.of(
"size", "18",
"retention.ms", "-1"
"retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis())
));
assertThat(cache.cache.asMap()).isEmpty();

Expand Down Expand Up @@ -253,13 +253,14 @@ void sizeBasedEviction() throws IOException, StorageBackendException {
assertThat(timeIndex).hasBinaryContent(TIME_INDEX);
assertThat(cache.cache.asMap()).isNotEmpty();

// because of the retention ms, it may be deleting cached values 1, 2 or both.
await()
.atMost(Duration.ofSeconds(30)) // increase to reduce chance of flakiness
.atMost(Duration.ofSeconds(30))
.pollDelay(Duration.ofSeconds(2))
.pollInterval(Duration.ofMillis(10))
.until(() -> !mockingDetails(removalListener).getInvocations().isEmpty());

assertThat(cache.cache.asMap()).hasSize(1);
assertThat(cache.cache.asMap().size()).isLessThanOrEqualTo(1);
verify(removalListener).onRemoval(any(SegmentIndexKey.class), any(), eq(RemovalCause.SIZE));
}
}
Expand Down
Loading